Chapter 1 Introduction

Welcome to our introduction to data science. This book will walk you through theory, technical details, code, and statistical output. We have tried to write this book to cut out the unnecessary details often found in many texts, but to go substantially deeper than traditional data science books, which tend to focus only on code and the application of techniques.

1.1 How to Use this Book

The text will contain theory, general notes, technical notes, code, and output.

We decided to write this text to balance the theoretical and practical applications of analytics. As both practitioners and students of analytics, we believe that all of the material contained in this book will be of benefit to those interested in analytics. For those interested in a deeper understanding, the technical notes will give more details on what is happening ‘behind the scenes’ of our models. At times, we will include extra information in a mouse-over text box like

Once theory is discussed, we will move on to practice. Each technical chapter will contain code and its corresponding output. It is worth noting that this text is geared toward the practical application of analytics, and may sometimes simplify the technical details for the sake of simplicity and practicality.

To get started, you will need Anaconda. Anaconda contains both Python and the IDE (Integrated Development Environment) Jupyter, which is our preference for Python. Most of the packages you will need are pre-installed with Anaconda, but we will let you know when you need something new.

As you might suspect, you will need some data to follow along. Below, you can download an Excel file which contains the data we will use throughout the book. In addition, there will be some duplicate data stored in a CSV. Different data formats have different characteristics, so some sections will encourage you to use the CSV version instead.

Download From_Data_to_Insight_DataSets.xlsx
Download grocery_data.csv

Before we get started, it is worth defining what exactly we are covering.

1.2 What is Analytics?

We would be willing to bet, if you asked a dozen different data scientists “What is analytics?” you would get at least a dozen different answers – most riddled with ‘gradient boosted this’ or ‘scikit-learn’ that.

For our purposes, analytics means using data to gain insight or drive action. That means you don’t need machine learning or artificial intelligence; sometimes, you just need a great chart. One of the most impactful analytics use cases we have encountered culminated in a single slide with one visualization. The insight from that one slide drove over $40 million in value to the business.

In this text, we will divide analytics into four categories.

1.2.1 Descriptive

What happened?

Descriptive analytics is based on describing data. Usually this means summarizing raw data and transforming it into an easily understandable format. It is important to note that descriptive analytics is based on describing the past.

It often manifests in reporting, dashboards, and data visualizations. Common examples include financials, sales, or operations reports.

1.2.2 Diagnostic

Why did it happen?

Diagnostic analytics is still looking backward, but now trying to understand historical results instead of just reporting on them. Unfortunately, most of today’s Business Intelligence (BI) tools stop at descriptive without providing the tools to answer why something happened.

Diagnostic analytics often involves statistical analysis, data mining, or anomaly detection techniques to dig deeper on ‘why’.

1.2.3 Predictive

What could happen?

Instead of looking backward, predictive analytics begins to look forward. Here we begin to use predictive models such as regression, decision trees, or other machine learning models to predict events or outcomes.

Predictive analytics is commonly used as part of forecasting, risk management, and marketing.

1.2.4 Prescriptive

What should be done?

Prescriptive analytics takes predictive models to the next level and makes recommendations as to the best action to take. While prescriptive analytics systems typically start out as supervised systems, they are often developed to work as fully automated systems.

Prescriptive analytics is often used in fraud modelling (declined credit card transactions, specifically), inventory optimization, or marketing Next Best Action models.

Here is how Gartner breaks out the types of analytics.

Types of Analytics

Figure 1.1: Types of Analytics

1.3 Artificial Intelligence (AI) & Machine Learning (ML)

So, what is the difference between analytics, machine learning, and AI? We tend to think of the different types of analytics in this order, from broadest to most specific:

Analytics AI Machine Learning Deep Learning

1.4 Artificial Intelligence

We think of AI as being a blanket term meaning any time a machine tries to mimic a human. In the broadest sense, this means any human activity – making predictions, judgement calls, understanding sensory input (visuals, auditory signals), etc.

Ultimately, one of the goals of AI is to achieve general intelligence, a machine that can learn and think by itself. We are inching closer to that ability as evidenced by Google’s Deep Mind and its recent success in AlphaGo and StarCraft. Other examples include chat bots and smart assistants like Siri, Cortana, or Google Now.

While the current state of the art in AI tools is not a generalized intelligence, they are still an important tool in analytics.

1.5 Machine Learning

Machine learning is a specific technique under the AI umbrella. It is used to take large amounts of data and produce mathematical models that describe that data. Machine learning is commonly used now, but has been around since the 1950s. It is only with the recent increase in computational capacity and the availability of data that we have really been able to take advantage of machine learning.

Contained within machine learning are supervised (predictive models, like regression or classification models) and unsupervised learning (like clustering or anomaly detection). Supervised techniques are ones where a sample dataset is provided that contains both the input and output data for the model. For example, a supervised technique for predicting fraud would contain data on transactions and a classification of whether or not each transaction was recognized as fraud. Unsupervised techniques are used when there is no data on the target characteristic. So an unsupervised technique for fraud detection would be presented with data on transactions without a coding of whether or not it was fraud and the system would be tasked at identifying different patterns within the data.

Machine learning examples are widespread; almost anytime you build a predictive model, you are using machine learning. Marketing, risk, operations, and many other areas of large organizations are using machine learning regularly. More advanced models are available, but often not necessary.

1.6 Deep Learning

Deep learning is a more advanced machine learning technique often involving neural network methods. It is typically used for very complex problems, like understanding audio, video, and other unstructured data. The ‘deep’ in deep learning refers to the number of layers in the model, effectively representing feature transformations. We will not delve deeply into deep learning here.

1.7 Analytics Myths

Myth 1: Machine learning means machines are learning on their own

We still do not need to worry about Skynet, Hal-9000, or any other general intelligence taking over the world (yet). State of the art machine learning is just a technique to solve problem using large amounts of data. They are still very human driven.

Myth 2: AI & ML are the solution to everything

You do not need advanced techniques to drive value with analytics. Sometimes, all it takes is some basic analysis and a great story. Other times, machine learning is not an appropriate technique because there is considerable information on the problem that is not easily extracted from the data but can be imposed on a model by an analyst. For example, in cases of collinearity (where two or more variable exhibit similar trends), machine learning models cannot distinguish the difference.

Myth 3: Analytics is synonymous with machine learning

Analytics includes a whole host of techniques, from simple visualizations through to the most advanced neural network. A good data scientist uses the best tool for the job, not the most complicated one.

Chapter 2 Data Science Technology

2.1 Data Science Tools

While there are dozens of data science tools available, there are three free languages that will serve you well throughout your data science career. Structured Query Language (SQL) for data acquisition; Python & R for analysis. In this text, we will spend our time in Python, Jupyter, and learn how to code in SQL.

2.1.1 Python

Python is a general purpose programming language with a strong set of data science tools. KDNuggets, a leading blog on data science and AI, suggests that Python has now become the most popular data science tool Piatetsky (2018). Python is another open-source, general purpose programming language with an emphasis on code readability. One of Python’s most notable packages, scikit-learn, is the go-to package for machine learning and deep learning applications. There are a number of popular IDE’s for Python. Our favourite is Jupyter as part of the Anaconda distribution, though there are many others.

2.1.2 R

At the highest level, R is a programming language focused on statistical computing and graphics. It is helpful to think of R as a free, open-source alternative to other statistical computing software such as SAS, SPSS, and Stata. A large part of R’s success is its open source nature, which has allowed programmers to create and publish new packages to provide more and more complex analyses. Given its price (free!) and the supportive community of practice, all would-be data scientists should have at least a passing understanding of R.

2.1.3 RStudio

RStudio is an Integrated Development Environment (IDE) and is the graphical interface to R. While you can use R on its own via command line, RStudio will make it much easier to produce and save all your data science work.

2.1.4 SQL (Structured Query Language)

SQL is a programming language specifically designed to get data in and out of relational databases. It has been the standard database language since 1986 and is still used today both for relational databases and a number of Big Data tools. While a programming language from the 80’s, SQL is still instrumental in accessing the world’s data today. SQL is not its own software package, but instead a language used by Database Management Software (DBMSs).

There are a few commonly used DBMSs. We do not believe they are very differentiated, but there are sometimes syntax/feature differences among them. Most popular among them are MySQL, MS SQL, Oracle SQL, PostgreSQL and IBM DB2.

2.1.5 Summary

Fortunately, all these technologies are available on any platform you choose - Windows, Mac OS, or Linux.

The above tools are certainly not an exhaustive list. SAS, SPSS, Stata, and MatLab are still commonly used, however, we will not be discussing them in detail.

In Chapter 5 we will discuss Big Data tools, including Hadoop and its accompanying software packages.

2.2 Self-Serve Data Science

As data science, analytics, and AI become more commonplace in industry ‘self-serve’ style data science tools have become increasingly popular. These tools do not require any code and aim to democratize data science work by allowing one to create a predictive model with minimal technical background. While it may sound self-serving, we are skeptical and believe these tools should not be used as liberally as advocates recommend, and that they could produce dangerous and career-ending decisions if used improperly. In later chapters, we will discuss the different use cases for predictive models and how in certain cases, improperly built models could create significant issues.

2.2.1 Rapid Miner

RapidMiner is an all in one data science IDE. It facilitates data acquisition, preparation, visualization, and predictive modelling all in an easy to use visual interface. The interface is built around a process flow where you are able to drag and drop different data, data transformations, and models into a process diagram. When run, the process executes in order allowing for consistent data transformations before your model.

2.2.2 Alteryx

Alteryx facilitates many of the same techniques as RapidMiner but is more focused on a repeatable data preparation process rather than advanced analytics. It uses the same process flow diagram and can connect to a wide variety of data sources including Office, SAS, relational databases, and Big Data platforms.

2.2.3 Data Robot

Data Robot is an advanced analytics modelling tool. It has limited data preparation features and instead requires a nicely formatted dataset to begin work. Once you have data ready, you provide a target variable to DataRobot and it will automatically try hundreds of different models, eventually selecting the most accurate one. Once the model has been selected, DataRobot provides a number of model evaluation, feature importance information, and visualizations.

2.2.4 Trifacta

Trifacta represents the opposite end of the spectrum as DataRobot – it only does data wrangling – the process of taking raw data and turning it into something usable for analytics. Trifacta supports a connection to multiple data sources, handles large data, and allows users of all skill levels to clean and enrich corporate data without any code.

2.3 Tools in This Book

Make sure you have Anaconda installed, and a Jupyter notebook ready to go. You can also consider Tableau, and should sign up for the one year student trial or use Tableau Public to get started. This book is produced in R Markdown, which allows us to embed real analysis and output from Python and SQL in combination with theory. Python output will look like this:

2 + 2
## 4
print("Hello, world!")
## Hello, world!

You will also see package names denoted the same way, but inline with other text: NumPy.

We encourage you to follow along with our analysis in your own instance of Python and Jupyter. We have provided the series of datasets that will be used throughout the book.

2.4 General Note

Ultimately, the purpose of this book is to help you understand the data science process. The technical side of analysis is, in our opinion, less important than understanding what the output means, how to solve business problems using analytics, and how to convince the people around you to make data driven decisions and ultimately make positive changes for the organization. We use Python as a platform to demonstrate how these analyses can be conducted, but the tool you are using is certainly less important than your story. All of these techniques are possible in any modern analytics software.

2.5 Technical Note: R vs Python

We are often asked which language is better – R or Python? The answer is often ‘it depends’. As mentioned earlier, we are largely tool agnostic. For completeness, here is a brief comparison between R and Python.

From a capability point of view, R and Python are nearly evenly matched. Most analytics professionals suggest that R is easier to get started with, while Python is a more flexible general programming language. The best way to describe the difference is in usability. Python has a small number of primary packages that allow you to accomplish most tasks. Those packages require a fair bit of coding to get the job done. R has a massive library of small packages that require very little code to do a specific task. Certainly, R is ahead of the game when it comes to statistics, but Python is more broadly applicable. See Data Camp’s discussion on the difference here.

In summary, both R and Python are capable programming languages. In making a decision on which language to learn or use, we would suggest evaluating against two criteria: first, what are your colleagues using? Second, are you looking to complete analysis (R), or put a model into production (Python)? One thing to note is that if you have learned a programming language, the next one will be easier to learn.

Chapter 3 The Data Science Process

Many practitioners think of data science as primarily coding and model development. Don’t.

Data science is just one set of problem solving tools and, like other tools, has to serve a purpose. While modelling is a powerful tool for solving business problems, a model alone will not provide value. We like to think of data science as following a process from defining the problem through taking action on an insight.

  1. Define
  2. Discover
  3. Explore
  4. Analyze
  5. Communicate
  6. Operationalize
The Data Science Process

Figure 3.1: The Data Science Process

We like to visualize these steps in a circle to represent the iterative nature of a data science project. While there is a natural order to these activities, do not be afraid to go backward or forward as necessary to accomplish your goal.

3.1 Define

Are we solving the right problem?

The way you will drive value as a data scientist is not through building the best model, being a better programmer, or being effective with more tools. Data science is a very powerful problem-solving tool, but ultimately you need to be solving the right problem to drive value. Moreover, you need to convince others to act on your insights to make positive change in your organization.

An analyst once approached me lauding himself for the accuracy of his model in predicting customer age. He had combined data from across the bank – credit card transactions, product portfolio, mutual fund purchases. Admittedly, it was a great model. My only question for him was: “Why don’t you just use the customer’s date of birth to calculate age?”. At this stage in the data science lifecycle, we suggest you collaborate closely with your customers or clients, whether that be your business partners or your actual customers: every project has a customer of some sort. Your goal is to understand the business, define the problem and scope of work, and ultimately build a benefits and value plan.

We often will not start an analytics engagement until there is a well-defined ‘size of prize’. Don’t let the name ‘data science’ fool you. This is not about science experiments but instead about adding value.

3.2 Discover

What data do we need?

Once your business problem is well defined you can begin thinking about the data you will need. The data acquisition exercise should start with theory. Which variables do you believe will be good predictors of your outcome? How much history will I need to capture full business cycles? In general, more data is better, both in terms of number of variables (scope) and number of observations (quantity). The absence of an important variable can seriously limit the insights you can draw from your data – if you do not have, or cannot use, a customer’s recent transactions in an analysis, your insights will be limited. If you do not have a sufficiently large sample you will be restricted in the kinds of analysis you can do.

Too much data – in either scope or quantity – is almost never a problem since there is no obligation to use all of the data. You, or your model, may choose to exclude some of the variables. If your sample is large, you may split it into several subsets to be used for model development, testing, estimation, etc.

At the end of this stage, you should have an initial dataset ready for modelling. That means you have combined all your sources, cleaned the data, and transformed it into a format useful for modelling. In our experience, this is the longest stage of an analytics projects. Data requests, privacy assessments, and technology constraints take inordinate amounts of time at large organizations.

3.3 Explore

What does the data tell us?

Here, we begin crunching the data. At this stage in the project we often bring our business partners or clients back to the table to share initial insights from summary statistics and preliminary data visualizations. As mentioned earlier, your job is to solve your customer’s problem. This is a good time to take information back to them and see how it resonates – what we call the ‘gut check’. It is also an opportunity for you to get more familiar with your data and your customer’s requirements; an important precursor to analysis.

3.4 Analyze

What type of analytics will solve the problem?

Now that we have the problem defined, our data formatted correctly, and we understand our data well, we can finally get to the fun stuff – modelling! We will keep the discussion short here and go much deeper in later parts of this text. In essence, here is where you will be doing the real analysis – whether descriptive or predictive, the output from this stage should solve the business problem you identified earlier.

3.5 Communicate

What is the story?

Unfortunately, building the best model with the highest R2 and most advanced technique will only impress your data science friends. It is unlikely to convince any business owner to take action. Communication and storytelling is an under-appreciated skillset for data scientists. In this stage, you should think about how you are going to communicate your recommendations to your customer. You should be thinking about a storyboard that follows one of the classic consulting frameworks: situation, complication, resolution. Note that in this framework you do not talk about the details of your model. Instead, those should be placed in an appendix.

3.6 Operationalize

How do we put the solution into production?

Ultimately, the point of a data science project is to make change and drive value. To do this, you are going to have to put a model into production, change an existing business process, or create a new process. Every organization handles this differently; our recommendation is to get familiar with your organization’s process early on to make sure you will be able to drive value later. You should understand whether and how models need to be validated, what software or hardware they need to run on, who the decision makers are and how to get your results in front of them.

At this stage, you should also consider end-user training, how to maintain and support your model (i.e., will it need to be updated or re-evaluated regularly?), and how you plan on continually improving your results. These will ultimately require time, money and other resources in a degree that varies by project. You should plan for the ongoing maintenance and development of your project before you start.

3.7 General Note

Data without context is meaningless. From a practitioner perspective, it can be exciting to try new packages, techniques, and languages. We encourage you to get excited by the value your data and analysis can drive, not the act of analyzing it. Your role as a data scientist will be to work closely with your business partner or owner to understand the business, their data, and ultimately to use that data to take action and drive value.

At Alex’s organization some analytics teams set aside a day a month for ‘fun’ data science projects where you do not have to have a specific business problem in mind. This is a great way to stay sharp on new tools, do innovative work, but still provide business value day to day.

Chapter 4 But First, Data!

Data is the backbone of data science. It comes in all types, shapes, sizes, and levels of quality. This chapter is meant to give you an idea of the different ways we classify and store data. First, a definition: data is any set of either quantitative or qualitative values about any set of subjects. Often, we use data and information synonymously. In fact, data does not become information until there is context; often post-analysis.

As a quick recap, data comes in four flavours, ordered by how much data they contain.

Nominal Data speaks to categories, such as being Canadian. They are often coded as 1s and 0s, but the numeric values have no actual meaning.

Ordinal Data also speaks to categories, but now adds meaning to the rank. Ordinal data is common with satisfaction surveys, which often has 5 different levels of satisfaction in increasing order. The range between values is not meaningful; for example, a statisfaction survey may have options ‘Very Dissatisfied’, ‘Dissatisfied’, ‘Neutral’, ‘Satisfied’ and ‘Very Satsified’ as 1, 2, 3, 4, and 5 respectively. These numbers reflect an order to satisfaction, not a measure of the degree of satisfaction. The difference between ‘Satisfied’ and ‘Very Satisfied’ may be much larger in quantitative terms than the difference between ‘Neutral’ and ‘Satisfied’; though the coded difference is one in each case.

Interval Data adds meaning to the distance between values, but does not have an absolute zero. Temperature is a good example of interval data, where zero degrees Celsius is very different from zero degrees Fahrenheit. With interval data it does not make any sense to describe percentage improvements. For example, the move from 20 to 25 degrees Celsius equivalent to a 68 to 77 degrees Fahrenheit, but the former suggests a 25% increase and the later about a 13.2% increase.

Ratio Data contains the most information; there is an absolute zero, and the distance between values has meaning. Any salary or dollar value is ratio data. This is the most useful data for modelling! If the Canada to U.S. exchange rate is $1US = 1.30 $CDN, then if you earned a $100,000 US salary and received a raise of $10,000 to earn a total of $110,000 US, that would be equivalent to a $130,000 with a $13,000 raise to earn a total of $143,000 CDN. In both cases the percentage increase would be 10%.

Data can also be categorized by its relation to time.

Cross-Sectional Data has no time dimension, and is what we are used to seeing in a standard data table. Transactions (i.e., credit card or debit card transactions), customer records, or point in time financials are examples of cross-sectional data. With cross sectional data, any analysis that can be legitimately performed on the data can be performed in any sequence without changing the outcomes.

Time Series Data contains a time series component, such as stock data.

Panel Data combines the two, allowing us to track more complicated data, such as sales from multiple stores over multiple time periods.

We will not be covering time series data as part of this text, but it is worth noting that time series data frequently violate the standard assumptions behind linear modelling and can cause some truly bad outcomes.

4.1 Internal vs. External Data

Internal and external data are distinguished by their source. Internal data is generated by the business and includes data points like customer information, sales, transactions, personnel records, and operations data. External data is produced outside the organization and tends to have broader reach. External data includes data points such as census information, competitor information, demographic data, and economic data.

While both types of data are important, it is useful to distinguish your data this way. External data can be expensive, but often comes pre-cleaned, formatted, and with a useful data dictionary. Internal data, while free, is often hard to work with up-front. Our experience suggests data scientists at large organizations lean too heavily toward external data and forget that internal data has one huge redeeming quality: you are the only one with access to it.

Whether your data is internal or external, all data falls into two categories – structured or unstructured and small or big.

4.2 Structured vs. Unstructured Data

Structured data is what you are likely most familiar with. Any data that fits neatly into a table or relational database is considered structured. Most structured data fits neatly into rows, columns, and fixed width / type cells. In most databases, data needs to be pre-defined. For example, the database will need to know if a column is a number, text, or a date. Until recently, analytics focused primarily on structured data. With the advent of machine learning and deep learning we are now able to process unstructured data.

Unstructured data is everything else. Data that does not fit neatly into a table is considered unstructured. This data ranges from free form text to pictures, audio, or video files. Some common sources of unstructured data businesses are starting to utilize include social media data, call center audio, and click-stream data from websites. It is worth noting that for modelling purposes even unstructured data must be structured. Techniques like NLP (Natural Language Processing) are used to take unstructured text data and provide structure to them.

This text will focus primarily on structured data.

4.3 Small vs. Big Data

Most analytics practitioners have heard the term (and resulting hype) around Big Data. You should know that Big Data is nothing but a marketing term coined by IBM. There are no boundaries or fixed definitions on what makes data ‘big’. Instead, we like to think of big data as a series of techniques used when data size surpasses the capabilities of traditional data processing applications.

For context, we are storing exponentially more data now than ever before. IDC suggests that the total size of all our global data is approximately 33 Zettabytes (1 sextillion bytes, or 1 billion Terabytes) as of 2018. By 2025, the IDC expects total stored data to grow to 175 Zettabytes (David Reinsel 2018). Our increasingly digital world is driving the massive growth in data. With the advent of Internet of Things (IoT) technology, our cars, kitchen appliances, and TVs are producing and collecting data.

Think of the last Uber ride you took. There is the obvious data that Uber collects; where you started, where you are planning on going, how many people, etc. However, a simple Uber trip produces a massive amount of data you may not have considered. Uber now knows what traffic is like on your route, where you work, where you live, what music you like to listen to (if you connected Spotify), and how price sensitive you are.

IBM provides four metrics to evaluate your data against to understand how ‘big’ it may be:

Volume

Volume is the total amount of data you need to store. There is little use in trying to assign a minimum volume for big data; the amount of data we produce is growing so rapidly, any metric created today would be too small next year.

Variety

While it is possible to have big structured data, most data scientists consider unstructured data to be a major piece of big data analysis. Variety speaks to the number of different types of data you have. Today, we expect most companies to have their traditional relational databases with transaction and customer information and be able to pair that with other, unstructured data such as social media, phone calls, pictures, and voiceprints.

Velocity

Velocity speaks to the speed that data are created, recorded, and actioned. Think of how many credit card transactions are processed in a minute; a risk model to identify credit card fraud must immediately score every one of those. Now, imagine how many more tweets or text messages are sent in a minute. High velocity data is much more challenging to work with. New tools, such as Kafka have been designed to help with this exact problem.

Veracity

Veracity is also known as the trustworthiness of data. With small data you can often vet and clean data before analysis. As data gets bigger and faster you need to process in real time, without cleaning or understanding underlying distributions. Because of this you need to have an understanding of how trustworthy the data coming in is and whether you believe it to be inherently biased.

4.4 SQL vs. NoSQL

SQL is effectively synonymous with a relational database or relational structure. Once we leave the classic relational database behind, we will need new technologies and new languages to interact with our data.

Here is a brief discussion on the differences in database types.

Relational Database Unstructured Database
Vertically scalable (add more hardware on a single server) Horizontally scalable (add more servers)
Pre-defined schema No schema
Great for complex queries Great for simple queries
Used for high volume transaction data More flexible for data storage, but not as robust
Standardized across many platforms Substantial variation in technology and language

There are a few common types of NoSQL databases. This section is primarily to provide background on the databases you may encounter in your role as a data scientist. Often, the database will have already been created and you will need to know how to interact with it.

4.4.1 Key-Value

Key-value databases have a very simple data model – each entry has a unique primary key (usually an integer like an ID, or a string like a web address) followed by an entry that could contain anything (like an integer or string). The ‘value’ piece of this database are extremely flexible and do not require any consistency or schema – each value could be a different data type.

These databases are very simple, easy to scale, and effective at a large number of small read/write operations, however they do not support complex queries such as ranges. Common uses include shopping carts and product details in e-commerce, telephone directories/IP forwarding table in networking, and emails/user profiles for general internet storage.

Common key-value databases include Redis, Oracle NoSQL, Riak, Aerospike, Apache Ignite, and many more.

4.4.2 Document

Document databases are very similar to key-value with the primary difference being the ‘value’ is replaced with a document such as a JSON, XML, or BSON. By using a self-describing document instead of a key, we allow for querying of the database based on the content of the document.

Document databases are commonly used for storing user online user profile information and for real-time big data applications.

Common document databases include MongoDB and CouchDB.

4.4.3 Columnar

Also known as wide-column or column-oriented databases, columnar databases follow a column focused data model. Each row has a row key and you can look up any column value with that key and a column name. It takes the classic relational database and effectively creates a unique table for each column. In a relational database a customer table might have an ID, a customer name, address, and age. In the equivalent columnar database you would have 3 tables, each with the same ID and one of the columns.

Columnar databases are used for huge volume, availability, and read/write activities. They also allow for more flexibility than a traditional relational database as each row key can have different columns associated with it. Facebook, eBay, and Yahoo all use columnar databases.

Common columnar databases include Cassandra, Google BigTable, MonetDB, and simpleDB.

4.4.4 Graph

Graph databases do not follow a similar pattern to the three above noSQL databases. Instead, they are used to graph out entities and relationships. Graph databases have nodes and edges, where nodes represent ‘things’ and edges represent ‘relationships’. For example, you can have a node that is any customer of the bank and edges that represent addresses or phone numbers. Using a graph database, you could identify roommates or partners using the address, or fraudsters that use the same phone number on dozens of accounts.

Graph databases do an excellent job at showing relationships, but grow exponentially with more nodes. They are commonly used in social networks and fraud detection.

Common graph databases include Neo4j, OrientDB, and Blazegraph.

4.5 Functional vs. Non-Functional Features

So, we have talked about at least 5 different types of databases, many of which serve a similar purpose. Database administrators use two ranges of decision criteria to best decide which database suits a specific purpose.

4.5.1 Functional Features

Functional features speak to the databases ability to return results. Some databases do not have the same filtering or querying abilities. We typically think of these as falling into 5 categories:

  1. Ranges
  2. Filters
  3. Sorting
  4. Full text search
  5. Aggregation (sum, average, etc.)

4.5.2 Non-Functional Features

There are three non-functional features worth considering: consistency, availability, and partition tolerance. states that you can only ever have 2 of those 3 in a distributed database.

Consistency

A guarantee that every node in the database returns the same value for the same key. In other words, everyone has the same view of the data all the time.

Availability

The likelihood that any given request receives a response.

Partition Tolerance

The database continues to operate despite network issues between nodes.

4.6 Hadoop

A discussion on NoSQL databases would not be complete without mentioning Hadoop, a very well-known distributed storage and processing software. Hadoop is a set of software packages designed by the Apache Software Foundation that allows for distributed computing. A traditional relational database is stored on a single server. A Hadoop database is broken into many smaller pieces, called blocks, and distributed across many nodes in a cluster. The distributed structure allows for horizontal scalability (add more nodes, get more power and storage) and parallel processing (all nodes can be working on the same task at the same time, combining results later).

Hadoop can handle both structured and unstructured data and integrates well with a number of data science tools. Hadoop clusters are becoming increasingly common as the infrastructure for data lakes.

4.7 How Python thinks about Data

Each programming language has its own way of thinking about data. Python has a few frameworks and packages that make it easy to store and work with data. All data in Python is stored to an object using the ‘=’ operator, which broadly reads as ‘is assigned the value’. So the Python statement ‘x = 7’ means that the variable ‘x’ is assigned the value 7. Objects can be interacted with using a number of Python packages; for our purposes, largely Pandas, NumPy, and Sci-Kit Learn. Python has primarily 3 categories of objects, each with specific data types.

In Python there are also different types of numbers, such as integers (numbers without decimals), floats (numbers with decimals), factors (categorical data), etc. This is largely taken care of behind the scenes.

4.7.1 Single Entries

As the name suggests, these are single strings (i.e. groups of text) or numbers. It is important to note that strings could be entire words or sentences. Most programs assign anything in quotations as a single string (e.g., “a” is a string, but so is “data science is great”).

a = 1
print(a)
## 1
b = "Hello, world!"
print(b)
## Hello, world!

4.7.2 Arrays

Ararys store multiple numbers or strings to a single object. You create an arary using the NumPy.array command, or create a dataframe using pandas.DataFrame. Dataframes and arrays are very similar, but typically arrays have better performance and are what we will use for predictive modelling. Arrays can be single dimensional (like a vector) or multidimensional, like a matrix or dataframe.

import numpy
array = numpy.array([[1,2,3,4], [5,6,7,8]], dtype=numpy.int64)
print(array)
## [[1 2 3 4]
##  [5 6 7 8]]

4.7.3 Lists

Lists are the most robust object in Python and can contain any combination of the preceding objects. Lists can be stored as standard lists, dictionaries, or tuples. Dictionaries contains key-value pairs where each ‘key’ is associated with a ‘value’ such as Minnesota:Twins, Seattle:Mariners. Tuples are immutable lists (i.e., lists that cannot be changed).

list = ["This is a list", 1, 3, 5]
print(list)
## ['This is a list', 1, 3, 5]

Chapter 5 Data Management

Now that we know how to store our data technically, we should discuss the best practices behind data management. Anyone who has spent time at a large corporate attempting to do data work will appreciate the importance of good data management. Without it, data is hard to access, dirty, and generally not useful for analytics.

Data management refers to the administrative processes which ensure data is accessible, reliable and timely for its users. Many organizations have amassed large quantities of systems / tools through acquisitions and specific operational needs, which makes it difficult to effectively manage their data and develop their analytics capabilities at an enterprise level. However, in the past decade the recognition of the value of data has increased to the point that some organizations have introduced a Chief Data Officer role, recognizing the importance of data as an asset and the need for enterprise-wide formal data management practices.

We will cover the key components of data management and the approach to developing an effective organizational strategy.

5.1 Data Governance and Stewardship

Establishing an appropriate data governance structure will allow organizations to translate enterprise requirements into business-specific strategies, and to proactively identify data-related issues and make decisions at the appropriate level. Data stewardship refers to the roles within the organization’s businesses responsible for executing and implementing data management processes.

We believe that data governance should include centrally managed enterprise initiatives, such as developing data management programs, policies, standards and adherence measurement capabilities, and enterprise committees with business representatives across the organization at various levels. During the early phases of establishing data management programs, organizations should seek high executive engagement in order to drive accountability across businesses and enable discussions around strategy and issue management.

Below is an example of what a data stewardship structure might look like. Organizations should establish the responsibilities at each level, usually determined by the maturity of their data management capabilities. Cross-organizational committees should be established at each level of data stewardship to raise and discuss enterprise data issues and strategies.

Example of a Data Stewardship Structure

Figure 5.1: Example of a Data Stewardship Structure

5.2 Critical Data Elements

Many established organizations have large amounts of data that they have collected over time. Unfortunately, it is difficult and time consuming to identify every single type of data and then apply data management programs to it. A better strategy is to identify critical data elements, which refers to the data points required to meet operational, regulatory and product servicing needs within an organization. Identification of critical data elements can be approached in two ways: top-down or bottom-up.

The top-down approach entails a centralized function, typically the Office of the Chief Data Officer, identifying the categories of data which are critical to an organization’s operations – for example: customer identification information, company liquidity information reported to regulators, product performance information, etc. Then, each business would identify the relevant data points within these groups and critical data elements can be identified at an enterprise level. This approach is more conducive to applying data management programs consistently, however it runs the risk of missing some critical data elements that are not categorized in accordance with enterprise guidance.

The bottom-up approach requires that each business within an organization to identify the data elements critical for their individual function. Typically, guidance is provided at an enterprise level to help businesses identify and categorize critical data elements. This approach helps to ensure all elements are captured, however it can pose challenges later on if they are not categorized consistently.

5.3 Data Quality

Data quality refers to the condition of data elements in the context of usability to meet business needs and inform decisions. This includes attributes such as completeness, accuracy and timeliness. Organizations should establish appropriate definitions and quality acceptance thresholds for each attribute and develop guidance for businesses on application to their critical data elements. When applying data quality, they should also consider the variations for each data element. For example, if a company operates in multiple jurisdictions a customer’s postal code may be numeric, alpha numeric, 6 digits, 5 digits, etc.

When establishing a data quality program, the following should be considered:

  • Relevant attributes:
    • Accuracy refers to correctness of the information provided. This can include from external sources (e.g. 3rd parties) or when data has moved through several systems. It can also refer to the currency of data. Following the example of the postal code, an accuracy check would ensure that the postal code provided for a customer by a 3rd party matches the internal record for the same customer, and that the internal record is the most up-to-date postal code.
    • Timeliness refers to the availability and accessibility of data when it is needed by the business. If postal code is critical to informing the organization’s marketing strategy, but it requires special access to an offline system to obtain, this would not meet the timeliness requirement.
    • Consistency refers to accuracy of data across systems. If data is stored in multiple sources, a mechanism should be in place to confirm the accuracy in each source.
    • Validity refers to a measure of conformity to a pre-defined syntax, including type or format. Postal codes may differ depending on the jurisdictions an organization operates in, and this should be considered when measuring validity.
    • Completeness refers to the comprehensiveness of a data element in relation to the dataset. If postal code was missing for a customer, this would be regarded as incomplete since it is critical in informing the organization’s marketing strategy.
  • Businesses should develop business rules for each attribute. For example, under the attribute of validity, the postal code field should be comprised of 6 alphanumeric digits for customers located in Canada, and 5 numeric digits for customers located in the US.
  • Organizations will need to develop measurement mechanisms for each of the relevant attributes. This will provide insight into the quality of critical data elements across an organization.
  • Acceptance thresholds should be established to identify major issues and inform engagement of different levels of governance (e.g. executive data stewards).

5.4 Metadata Management

Metadata refers to the data which describes data elements. For example, the metadata surrounding postal code may include the date it was collected from a customer, the source (e.g. driver’s license or verbal provision) and which system it was transferred to. This information is important in organizing and keeping track of data elements. It is also an extension of data quality so the two programs should be considered together during execution.

From a technology perspective, metadata management is important in creating links between data elements and users across data repositories (e.g. CRM and ERP systems). It also helps create consistency in the way data is stored and shared. Common challenges include difficulty in tracking the origin of data.

Metadata attributes should be determined based on the needs of the organization. Examples of attributes include: origin, date, provider, storage system, verification method, expiration date, security classification, and others.

5.5 Master Data Management

Master data management refers to a system which enables an organization to link all critical data to one file which provides a common point of reference. This is often a challenge for large organizations if they have grown through mergers and acquisitions, or if their business lines have adopted disparate systems and master data management practices. For example, let’s consider a banking customer who opens a registered retirement account in a branch. If the bank’s marketing department does not have access to the same information as the retail branch group, they may target the customer with advertisements about registered retirement accounts when he or she already has one. This will result in a poor customer experience and an ineffective business management strategy. The marketing department will use resources on a customer who will not buy the product, and the customer will feel like the bank is disconnected and does not understand his or her needs.

Master data management programs seek to address this issue by creating one reference point for all businesses. This is done by creating a consolidated data management system which integrates all systems across an organization or creating reconciliations between management systems that cannot be integrated to ensure all data is equal. Master data management processes may include data collection, transformation and normalization, business rule administration, data classification, taxonomy and others. The extent of complexity in these systems will reflect the complexity of the organization, with the ultimate goal being the creation of one point of reference for all business lines to enable a connected experience.

5.6 Data Lineage

Data lineage refers to the mapping of data flows across an organization. It is important to identify where data originates, its transformation points, storage and usage, in order to effectively apply data management programs and to support master data management systems. Without centralized data management, organizations may face challenges identifying lineage as multiple business lines can collect the same information (for example, customer name), or it may be transformed multiple times depending on where it flows.

With changing regulation, which we will talk about in the following section, there is increasingly more pressure for organizations to be able to identify data lineage and provide this insight to customers. To better understand data lineage, organizations can start by asking their lines of business to identify lineage parameters for identified critical data elements. These business-specific lines can be connected, a practice commonly referred to as stitching, to create an enterprise view of data flows and transformations. This process is more complex than it sounds, because even within one business line data can be used by multiple processes that may be disconnected. However, once complete, organizations are better able to manage their data and can be prepared to meeting evolving regulatory requirements.

5.7 Privacy and Security

Data privacy refers to how organizations use the data they have about their customers. It is important to use data responsibly and reasonably, and not to discriminate or disadvantage customers through inferences drawn from data, such as through analytics or predictive modeling. There are many laws that govern how data can be used, such as the General Data Protection Regulation (GDPR) in Europe or the Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada. Organizations should be aware of all laws applicable in the jurisdictions in which they operate, including where data is stored (for example, a company may operate in Europe but if they store data in California, they should comply with California privacy laws as well as GDPR).

The concept of acting reasonably with data is trickier. This generally refers to not doing anything with data that customers may perceive as invasive or “creepy”. For example, having an Amazon Alexa or Google Home in your house may have been perceived as invasive a decade ago, but today these products are found in many households and often depended on to create a smart home. However, let’s consider if a bank used customer transaction data to predict if he or she is an alcoholic in order to direct them to Alcoholics Anonymous or other social programs. Although the intention is to help the customer, it may be perceived as invasive and the customer may not understand why the bank should be involved in their personal life. This can result in non-compliance with the law and ultimately liability, in addition to severe reputational damage. The measure of reasonability will change over time; as technology advances customers are typically willing to share more data in exchange for convenience. Therefore, organizations will need to implement a method to measure customer sentiment and response to offerings before mass implementation. Methods can include population surveys or customer pilots.

5.8 Data Security

Data security refers to protection of customer data against breach. As dependency on technology increases companies become vulnerable through multiple entry points (for example, the cloud) and cybersecurity becomes a critical function. Having effective data management practices allows organizations to be aware of all data and systems that could potentially be threatened by external entrants, and to prepare accordingly. Data management groups (such as the Chief Data Office) should work closely with technology partners to understand data flows and establish best practices.

Chapter 6 Acquiring Data

Outside of organizations, data appears to be fairly easy to access. Websites such as Environics or statistics Canada provides access to mostly clean, nicely formatted data with exactly the records you are looking for. Do not get used to clean, reliable data - it is the exception, not the rule.

Data is the heart of any data science project. You should understand how to access data in your particular organization. This is more than just a technical exercise. From a technical perspective, our experience suggests that most data can be accessed using a variant of SQL. Corporate data is largely stored in relational databases, and even when data is stored in a larger data lake, big data tools often use a form of SQL (such as Apache Hive or Impala) to access structured data. That said, policy, legal, ethical and data ownership issues may restrict the access to data even when that data is contained within the organization. Consequently, it is important to understand the scope of data requirements at the outset of a project to ensure that the data may be used for the purpose you intend. Since these issues are highly organization-specific, we focus on the technical aspects of acquiring data.

It is worth discussing the different data formats you may encounter in your data science projects. Different data formats have different characteristics and must be handled differently.

6.1 Excel

Excel may not be the best tool for any data job, but it is a tool everyone has. Excel files are often easy to work with, though may require special packages to read in statistical software. Excel files are inherently well structured, though not immune to formatting errors. While it is a very convenient way to record and examine data, you will eventually hit computational and storage constraints with excel files.

6.2 Comma Separated Values (CSV)

CSV files are likely the most common format for sharing and storing data for analytics. Comma separated values literally means that each value for each row is separated by a comma (we call this a delimited file, and there are other delimiter options such as spaces and semicolons). This tells software packages where values begin and end. These files are universally compatible with analytics software, but are constrained to structured data. Often, data tables from relational databases are exported as CSV files.

6.3 JavaScript Object Notation (JSON)

JSONs are lightweight, text-based and human readable data objects. They can be easily edited in any text editor, and are commonly used to transfer data between servers. The format was built to be very flexible, and can contain numbers, strings, or arrays within a self-contained schema.

6.4 SAS, SPSS, and Stata

Naturally, some of the older (and more expensive) data science tools have proprietary formats. While software like SAS, SPSS, and Stata can read Excel files or CSVs, everything is output in their own format. Fortunately, there are packages available in both Python and R to support reading those formats. More on this in the next chapter when we discuss importing and cleaning data.

There are, of course, other less convenient data types, but we will save those for another book.

6.5 Database Structure

Before any database is build, a conceptual data model must be in place. Most analysts never encounter the work that goes in to building an efficient, effective database. We believe that you should have some background in how databases are built to give you an intuitive understanding of how to access data from them.

6.6 Entity-Relationship (E-R) Diagrams

E-R diagrams describe exactly what the name suggests: entities (or tables) and relationships (the link between those tables). A fully built E-R diagram describes exactly how each table is related to every other table using some simple notation and language.

6.6.1 Entities

Entities are synonymous with tables. They include:

  • A table name
  • Characteristics or Attributes (columns)
  • Unique identifiers (a primary key)

6.6.2 Relationships

Relationships describe how the tables link together. Tables always link on primary and foreign keys, but the relationships also have:

  • Names
  • Cardinality (or maximum cardinality)
  • Optionality (or minimum cardinality)

To better understand, let us consider a simple example. You are working on a database with two tables: one for customers, one for orders. Intuitively, a customer could have somewhere between zero orders (minimum cardinality) or many orders (maximum cardinality). On the other hand, an order can only belong to one customer (minimum and maximum cardinality). In database speak:

  1. Each customer may originate zero or many orders
  2. An order must be originated by one, and only one customer

We use symbols on a diagram to denote these relationships.

Entity Relationship Diagram Symbols

Figure 6.1: Entity Relationship Diagram Symbols

With these symbols, we can draw out the above relationship:

Example ERD

Figure 6.2: Example ERD

Naturally, these diagrams get more complex as you increase the number of tables in your database.

6.7 Schemas

A databases schemas is another way to represent the structure and relationships between tables. It is often used by analysts as a precursor to data requests. Specifically, a databases schema will list all of the tables, the relationships between tables, the primary and foreign keys to join on, and often all of the variables contained within each table. Schemas do not often have cardinality as an E-R diagram would.

6.8 Normalization

An important consideration in database structure and design is normalization. The underlying principle in normalization is to reduce duplication across tables. Normalization has a number of benefits, and one drawback.

A normalized database saves on space, but more importantly maintains data integrity by storing fields in only one location. Whenever data is inserted, updated, or deleted all duplicated data also needs to be modified. The more duplicated data, the harder and more computationally intensive the update process. On the other hand, a normalized database requires more tables to be joined, increasing query run time since the SQL command must perform more calculations while it is being executed.

6.9 Writing SQL

This section is meant to give you preliminary knowledge and some practice at writing SQL. We will cover all the basics you might need to pull data out of a database and do some basic analysis. For the purposes of this text we will be using a SQLite database. It is open source and available for free as SQLite Studio. While not as robust as a Microsoft or IBM supported tool, it certainly meets all the requirements for some at-home data science. If you are interested in writing SQL queries in R or Python, use the sqldf and sqlite3 packages respectively.

We will be working with two tables, both meant to mimic data you might find at a bank. The customer table:

Table 6.1: 5 records
id cust_name num_products province len_relationship class
1001 Alex Scott 3 ON 1 wealth
1002 Keith Rogers 4 ON 1 mass market
1004 Donald Trump 2 AB 2 hnw
1101 James Dean 3 BC 3 wealth
1006 Stephen Smith 5 ON 4 hnw

And the transaction table:

Table 6.2: Displaying records 1 - 10
id customer_id txn_type txn_total
1 1001 Debit 165.78
2 1001 Credit 42.10
4 1001 Credit 103.03
5 1001 Credit 56.60
6 1002 Debit 214.34
7 1002 Cash 115.69
8 1002 Debit 21.37
9 1004 Cash 227.58
10 1004 Credit 5.92
11 1006 Debit 225.89

You can download the customer table here: Download customers.csv

And the transaction table here: Download transactions.csv

6.9.1 Select

The simplest SQL query form is the select statement. Using a select statement you can tell your DBMS which columns you want from their respective tables.

SELECT id, cust_name, class FROM customer
Table 6.3: 5 records
id cust_name class
1001 Alex Scott wealth
1002 Keith Rogers mass market
1004 Donald Trump hnw
1101 James Dean wealth
1006 Stephen Smith hnw

Here, we can see the all the results from the ID, Name, and Status columns in the Customer table. Be careful with this! Your DBMS will continue to run until all the data has been output, even if your table is a million rows. You can replace column names with a ’*‘to mean ’all columns’.

 SELECT * FROM customer
Table 6.4: 5 records
id cust_name num_products province len_relationship class
1001 Alex Scott 3 ON 1 wealth
1002 Keith Rogers 4 ON 1 mass market
1004 Donald Trump 2 AB 2 hnw
1101 James Dean 3 BC 3 wealth
1006 Stephen Smith 5 ON 4 hnw

6.9.2 Removing Duplicates

Sometimes, data is duplicated in the database and you only want to see unique records. Fortunately, this is easy to do with the distinct operator. Notice the difference between the output of these two queries:

SELECT customer_id FROM trans
Table 6.5: Displaying records 1 - 10
customer_id
1001
1001
1001
1001
1002
1002
1002
1004
1004
1006
 SELECT DISTINCT customer_id FROM trans
Table 6.6: 4 records
customer_id
1001
1002
1004
1006

6.9.3 Sorting

By default, query results are not sorted in any particular order. Sometimes, you want them to be. You can sort your output by any column in ascending (default) or descending order, and can specify tie breaking columns the same way.

 SELECT * FROM trans ORDER BY txn_total DESC, "txn_type" DESC
Table 6.7: Displaying records 1 - 10
id customer_id txn_type txn_total
9 1004 Cash 227.58
11 1006 Debit 225.89
6 1002 Debit 214.34
1 1001 Debit 165.78
7 1002 Cash 115.69
4 1001 Credit 103.03
5 1001 Credit 56.60
2 1001 Credit 42.10
8 1002 Debit 21.37
10 1004 Credit 5.92

This output is sorted first by the total amount, and ties are broken by the transaction type descending.

6.9.4 Filtering

Using filters we can select only the rows we want from tables. Combining both column and row filtering makes this a very powerful database feature!

SELECT * FROM customer WHERE province = "ON"
Table 6.8: 3 records
id cust_name num_products province len_relationship class
1001 Alex Scott 3 ON 1 wealth
1002 Keith Rogers 4 ON 1 mass market
1006 Stephen Smith 5 ON 4 hnw

Of course, other operators are available. You can use criteria such as not equal (<> / !=), greater than (>), less than (<), and between (BETWEEN). You can also use wild card characters to ‘search’ for entries using the LIKE operator. There are two options: use one or more underscores (’_‘) to represent single characters, or a percentage sign (’%’) representing any number of characters.

 SELECT * FROM customer WHERE cust_name LIKE "Alex %"
Table 6.9: 1 records
id cust_name num_products province len_relationship class
1001 Alex Scott 3 ON 1 wealth
 SELECT * FROM customer WHERE cust_name LIKE "Alex _cott"
Table 6.10: 1 records
id cust_name num_products province len_relationship class
1001 Alex Scott 3 ON 1 wealth

The IN operator allows you to filter on members of a set. This becomes especially powerful as you can embed a secondary query here!

 SELECT * FROM customer WHERE id IN (1001, 1002, 1101)
Table 6.11: 3 records
id cust_name num_products province len_relationship class
1001 Alex Scott 3 ON 1 wealth
1002 Keith Rogers 4 ON 1 mass market
1101 James Dean 3 BC 3 wealth

Any of the ‘where’ criteria can be combined by using the AND or OR commands.

 SELECT * FROM customer WHERE cust_name LIKE "Alex %" AND class IN ("wealth", "hnw")
Table 6.12: 1 records
id cust_name num_products province len_relationship class
1001 Alex Scott 3 ON 1 wealth

6.9.5 Aggregate Functions

SQL includes a powerful feature called aggregate functions. These functions allow you to find the minimum (MIN), maximum (MAX), sum (SUM), average(AVG), or count (COUNT(*)) of any column. Of course, you can embed a filter with any of these queries to be more selective on what is aggregated.

 SELECT SUM(txn_total) FROM trans
Table 6.13: 1 records
SUM(txn_total)
1178.3
 SELECT AVG(txn_total) FROM trans
Table 6.14: 1 records
AVG(txn_total)
117.83

While powerful, aggregate functions can be tricky! If you ask for both an aggregated and an disaggregated column in the same query the DBMS does not know what to do, and will often just return a random line from the unaggregated column. The rule of thumb here is to make sure that all your columns are aggregated, or to use a GROUP BY. Here’s an example.

 SELECT customer_id, SUM(txn_total) FROM trans
Table 6.15: 1 records
customer_id SUM(txn_total)
1001 1178.3

Only one line here, and a random customer ID.

 SELECT customer_id, SUM(txn_total) FROM trans GROUP BY customer_id
Table 6.16: 4 records
customer_id SUM(txn_total)
1001 367.51
1002 351.40
1004 233.50
1006 225.89

Much better! Now, we get the individual sums for each customer. GROUP BY works in three steps. First, collect the unique values from the column you want to group by. Second, run that number of sums, filtering on the relevant rows. Third, report out both those results.

6.9.6 Subselects

We should avoid using aggregate functions as a way to filter other columns. Remember that aggregate functions only produce a single record; what would happen if you wanted to find all the customers who joined on the first day?

 SELECT id, cust_name, MAX(len_relationship) FROM customer
Table 6.17: 1 records
id cust_name MAX(len_relationship)
1006 Stephen Smith 4

Only 1 row! Let’s try a subselect instead.

 SELECT id, cust_name, len_relationship FROM customer WHERE len_relationship = (SELECT MIN(len_relationship) FROM customer)
Table 6.18: 2 records
id cust_name len_relationship
1001 Alex Scott 1
1002 Keith Rogers 1

It may look like more work to write a subselect here, but it is necessary anytime you are filtering with an aggregate function. With a subselect, we can see both customers who joined on the first day.

6.9.7 Joins

Up until this point we have been querying one table at a time. The heart of the relational database is being able to join tables together so we can tell the whole story. We join tables using primary and foreign keys.

 SELECT * FROM trans JOIN customer ON trans.customer_id = customer.id
Table 6.19: Displaying records 1 - 10
id customer_id txn_type txn_total id cust_name num_products province len_relationship class
1 1001 Debit 165.78 1001 Alex Scott 3 ON 1 wealth
2 1001 Credit 42.10 1001 Alex Scott 3 ON 1 wealth
4 1001 Credit 103.03 1001 Alex Scott 3 ON 1 wealth
5 1001 Credit 56.60 1001 Alex Scott 3 ON 1 wealth
6 1002 Debit 214.34 1002 Keith Rogers 4 ON 1 mass market
7 1002 Cash 115.69 1002 Keith Rogers 4 ON 1 mass market
8 1002 Debit 21.37 1002 Keith Rogers 4 ON 1 mass market
9 1004 Cash 227.58 1004 Donald Trump 2 AB 2 hnw
10 1004 Credit 5.92 1004 Donald Trump 2 AB 2 hnw
11 1006 Debit 225.89 1006 Stephen Smith 5 ON 4 hnw

Now we have all the data from both tables! We use the JOIN command to specify which table we want to add, and the ON command to specify the primary and foreign keys. Note some of the changes here. First, we need to specify which table the columns are coming from (trans.customer_id) because tables could have columns with the same names.

There are four different types of joins.

Inner joins only keep rows where the primary key exists in both tables, and excludes all other records.

Left joins keep all the primary keys from the left table in the query, and fill in NULL values if there is no match in the right table.

Right joins are the inverse of left joins. The output will contain all the records from the right table, and NULL values when they don’t match the left table.

Full outer joins keep all the records from both tables and could fill in NULLs on either side where there is no match.

6.10 SQL General Notes

6.10.1 Query Performance

Sometimes, SQL queries just take too long to run. If you have ever worked for a large organization, you may be familiar with a standard data science practice: start your code at 5pm and pray it generates the right results when you return the next morning.

6.10.2 Indexes

An index remembers where data is located, akin to a SQL phone book. Databases store data on hard drive blocks, and reading data from hard disks is very slow. With an index, the DBMS does not need to look for data in the table, but can instead look it up in a much small index, and then reference only the records it needs. Once you create an index, you do not need to do anything else - it works in the background!

There are, however, downsides. An index takes up disk space proportional with the size of your table. It also needs to be updated every time the table changes, which can add substantial computational requirements to maintaining your database.

Here are some tips for when to use indexes:

  1. Large table (small tables can be read fairly quickly and will not benefit as much from an index)
  2. Primary or foreign keys (speeds up joins)
  3. Any columns frequently in WHERE clauses (speeds up filters)
  4. Fields in order by or group by (speeds up sorting / filtering)

And when to avoid them:

  1. Any fields with long values (takes up disk space and slows down the index)
  2. Tables that change frequently (indexes need to be updated every time a table changes, making it computationally intensive for tables with frequent changes)

Some DBMS’ will have constraints on the number of indexes per table.

To create an index, use the following code: CREATE INDEX new_index ON table(ID);

6.10.3 De-normalization

As mentioned earlier in this chapter, normalization is a best practice for database design. Sometimes, in the interest of query speed we can denormalize tables (or, in other words, duplicate data across tables). Denormalization reduces the number of computationally expensive joins you will need to execute queries. Beware of the extra wasted space and data integrity issues with this approach!

6.10.4 Data Replication

Data replication is the most expensive option, and suggests you should store identical copies of the same data on multiple servers or in multiple data centers. This approach lets analysts access the data from the center closest to them, and distributes the workload over more hardware. This approach may be very expensive and complicated to implement; you need to have live data replication in place to make sure data is identical in both locations.

Ultimately, once your data hits a certain size, the relational database is unlikely to meet your needs. At that point, consider migrating to a Hadoop or other NoSQL solution.

6.11 Database Management Software

A DBMS manages everything in a database: the data, the engine that allows us to access it, and the data’s schema. They are our interface into the underlying database, giving us the ability to create, update, delete, and query. There are a number of different commonly used DBMS’ available; some have slightly different capabilities and syntax, but the general SQL structure is consistent throughout.

6.12 Technical Notes

SQL allows for the creation, deletion, and modification of data within a database. These queries are tricky to run in our R instance of SQL, but will work in a true DBMS.

6.12.1 Creating Tables

The first step in getting data into your database is to create a table. You need to give your table a name, name each of your columns, and tell the database what type of data each column contains. Here is an example:

CREATE TABLE customers {
 id INTEGER,
 name STRING,
 dob DATE,
 email STRING,
 acct_bal REAL,
 spouse_id INTEGER
};

Almost any data can be stored in a relational database. Some of the most common data types are listed below.

Integer Any number between -2^31 to 2^32.
BigInt Any number between -2^31 to 2^32.
Real Floating-point number between 1E-37 to 1E+37 with 6 decimal places
Date A date between Jan 1, 4712 BC and December 31, 9999 AD. Stores year, month, day, hour, minute, and second.
String Variable-length characters

There are many other data types, but these should serve you well most of the time.

6.12.2 Altering Tables

It is easy to alter tables. You can add, change, or delete columns:

ALTER TABLE customer ADD product_count INTEGER;
ALTER TABLE customer ALTER COLUMN product_count BIGINT;
ALTER TABLE customer DROP COLUMN product_count;

You can also delete entire table:

DROP TABLE customer;

Or, just the table contents:

TRUNCATE TABLE customer;

6.12.3 Constraints

You are able to constrain tables in a number of ways. By putting in the effort up-front, you can minimize data integrity problems at the back end. We will discuss a number of ways to constrain data entry in your databases.

6.12.4 Checks

A check constraint allows you to limit the range of values allowed in a column. These constraints go at the end of your create table query.

CREATE TABLE customers {
 id INTEGER,
 name STRING,
 dob DATE,
 email STRING,
 acct_bal REAL,
 spouse_id INTEGER,
CHECK (dob >= 1900-01-01 AND dob < 2020-01-01),
CHECK (email NOT IN (*'@hotmail.com', *'@yahoo.com'))
};

6.12.5 Not Null

You can also force columns to contain a value - especially important for certain columns. For example, you probably want to make sure your customers have a name. Any time you fill in a web form with a required field, there is a NOT NULL constraint in the background. You include the NOT NULL constraint after the data type in the create table query.

CREATE TABLE customers {
 id INTEGER,
 name STRING NOT NULL,
 dob DATE,
 email STRING NOT NULL,
 acct_bal REAL,
 spouse_id INTEGER
};

6.12.6 Default Values

If your field is not mandatory, but left blank, the DBMS can fill in a default value for you. Be careful with this, as the DBMS will not know when this default value is wrong. You include the default value after the data type in the create table query.

CREATE TABLE customers {
 id INTEGER,
 name STRING,
 dob DATE,
 email STRING,
 acct_bal REAL DEFAULT 0,
 spouse_id INTEGER DEFAULT 0
};

6.12.7 Unique

You can also ensure that every value in a given column is unique. You add the UNIQUE restriction at the bottom of the create table query.

CREATE TABLE customers {
 id INTEGER,
 name STRING,
 dob DATE,
 email STRING,
 acct_bal REAL,
 spouse_id INTEGER
UNIQUE (name)
};

6.12.8 Primary and Foreign Keys

As discussed earlier, tables are joined on their primary and foreign keys. You can tell your DBMS which columns are primary or foreign. The primary key constraint will also ensure that column is not null, and unique. Include the primary key and foreign key constraints at the end of your create table query.

CREATE TABLE customers {
 id INTEGER,
 name STRING,
 dob DATE,
 email STRING,
 acct_bal REAL,
 spouse_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (spouse_id) REFERENCES account(id)
};

6.12.9 Inserting, Deleting, and Modifying Data

Naturally, once your tables have been created you will want to add data. You will also need the ability to update tables later by appending or changing records. To add new records to a table you will need the INSERT INTO command. You need to provide a record for each column. The DBMS already knows the type of data, so put everything in single quotes.

INSERT INTO customer VALUES ('00001', 'Alex Scott', '1950-01-01', 'alex@as.com', '10.00', '00011');

Deleting data is also easy using the DELETE FROM command.

DELETE FROM customer WHERE id=00001;

You can change records with the UPDATE table command.

UPDATE TABLE customer SET name = 'Alexander Scott' WHERE id = 00001;

6.13 Practice Question

Using only SQL solve the mystery below. It can be completed with a single (advanced) query, but can also be solved piece by piece. The same question can be completed using Python packages as well for additional practice.

The database (for use in a DBMS and individual CSV files are available here:

Download a_secret_message.db
Download clues.csv
Download flights.csv
Download passenger_clues.csv
Download passengers.csv

6.13.1 The Mystery

Archeologists have recently discovered a tomb outside of Hiratsuka, Japan. When they entered the tomb, they noticed that the mummy and all valuables were missing, and all that was left was a bad joke: “What kind of music do mummy’s like best? Wrap”. Since this type of discovery is rare and all the valuables went missing, they opened the tomb up to tourists while they kept exploring. After the first few months of tourists came by, the archeologists noticed they missed a written note. Most of the text was intact, however the bottom was ripped off. The note reads:

‘I, Miyu of Hiratsuka, lie here in the beyond. I have elected this location because of its proximity to the great city, and I bid whoever finds this tomb first to deliver a special message to my beloved (three tombs down and on the right)…then to take the wealth and spread it among the people. If the special message is not delivered, I shall curse the lands for all eternity…’

The archeologists fear the worst - the special message was never delivered. They’ve engaged you to help identify the missing words. The local tour guide informs you that a number of tour groups have come through the tomb. As part of each tour, the guide gives them a souvenir. The first group got a note he discovered and ripped up directly from the tomb – the rest, something from the gift shop. He gave them each a piece of the special message - one letter for each person, after the tourists were organized in alphabetical order. He does not recall what the message said however, because he did not think it was important (what can you expect – we pay our guides minimum wage).

He does remember that all tourists were leaving on the same date – they were the very first group to arrive and leave, and that they were spread amongst flights to London, Tokyo and Shanghai.

Use the available database of passengers, flights, and clues to determine what the message said.

When you’re ready to see the solution, click the button below

Chapter 7 Data Cleaning and Exploration

Before any modelling can begin, you need to have clean data in a useable form, and you need to understand it. Unfortunately, this is the longest and most arduous part of the data science process. Many of the source systems at large institutions were constructed long before analytics was commonplace. This means the data you receive from your corporations or clients will almost certainly contain errors and omissions, and almost never come in a form useful for modelling.

We think of the data cleaning / exploration process as having 8 steps. We will discuss all of them in this chapter, but leave some of the technical solutions for later chapters.

  1. Variable Identification
  2. Data Structuring
  3. Univariate Analysis
  4. Bivariate Analysis
  5. Missing Value analysis
  6. Outlier Analysis
  7. Variable Transformation
  8. Feature Engineering

7.1 Variable Identification

While sounding obvious, variable identification can actually be fairly time consuming, especially if you are new to the data. In this stage, you will focus on understanding what is actually meant by each variable in your dataset. Often, the names provided will be largely meaningless, including acronyms, short forms, and sometimes meaningless or even incorrect text.

You will be relying on metadata, or in its absence, data owners and experts to help you understand what the variable means, how it measured, where it was sourced, and if it was transformed before being recorded.

If metadata is not available, we recommend building a data dictionary. It will serve you well throughout the project, but more importantly, anyone else who wants to use the same data later (perhaps a client or team member) will save days worth of time in this step.

7.2 Data Structuring

Data will rarely come in the form needed more modelling. In this stage, you will be transforming the data into a structure easily understood by statistical software. Put very simply, you need to make sure that each column is a single variable. If your target variable is split between columns, you will not be able to model it.

We call well structured data ‘tidy’. There are three characteristics of tidy data:

  1. Each variable is its own column
  2. Each observation is its own row
  3. Each value has its own cell

We like to add a fourth: if there are more than one tables, you should have columns to join them. Once your data is tidy, it will be easy to manipulate, model, and visualize.

7.3 Univariate Analysis

Once your data is structured you can begin examining it. Here, you will use summary statistics and visualization to better understand the distribution of each individual variable. As you examine each variable, it is worth keeping a few key questions in mind:

  • Does the range make sense?
  • Does the distribution make sense?
  • Should there be negative values?
  • Is the data coded correctly?

To answer these questions, we use summary statistics such as the maximum, minimum, range, average, median, mode, and standard deviation. Additionally, we can visualize the variables. For distributions, our preference is for histograms and for time series we recommend line charts. Here, we produce a histogram of sepal length from the Iris data set.

from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
data = load_iris().data
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width']
dataset = pd.DataFrame(data,columns=names)
fig = plt.hist(dataset["sepal-length"])
plt.ylabel("Count")
plt.xlabel("Sepal Length")
plt.show()

7.4 Bivariate Analysis

Bivariate analysis is about understanding your variables in pairs. Certain characteristics, such as correlation, cannot be determined with only single variables. The most common approach to bivariate analysis is producing a scatter plot. In fact, most statistical tools have the ability to produce a single visualization containing scatter plots for every pair of variables as well as the correlation coefficient. We often examine those plots as part of this step in data exploration.

The steps below here will be discussed in more detail in other chapters. We will provide definitions for completeness.

plt.scatter(dataset["sepal-length"],dataset["sepal-width"],s=2)
plt.ylabel("Sepal Width")
plt.xlabel("Sepal Length")
plt.show()

7.5 Missing Value Analysis

As the name suggests, here we must understand what data is missing, and the pattern of that missing data. There are techniques for dealing with certain kinds of missing data that should be considered before modelling.

7.6 Outlier Analysis

Here, we focus on identifying outliers in our data. We can split outliers into two categories – univariate and multivariate. Univariate outliers are ones where a single dimension is very large or very small relative to the distribution. So in a distribution of income, someone who earns $20,000,000 per year is an outlier. Multivariate outliers are ones that are typical in each dimension but are unusual in the interaction of two or dimensions. For example, it is not unusual to be a five year old in Canada, and it is not unusual to earn $100,000 per year in Canada, but it unusual to be a five year old in Canada who earns $100,000 per year.

Univariate outliers are fairly easy to find, and are often measured as anything greater than three standard deviations from the mean. Multivariate outliers are much harder to find, but fortunately are identifiable via regression plots. Outliers are important in analysis because they can have disproportionate leverage and influence on the outcome.

Leverage is the potential of a specific observation to impact an analysis, generally outliers have higher leverage. Influence is the actual impact of an observation on the outcome of an analysis. If an outlier fits the prediction of a model very well, it will have little influence even with high leverage. On the other hand, if it does not fit the model well, it can end up significantly influencing the model, arising in a situation that one of us describes as ‘the tail wagging the dog.’

Because of their leverage, outliers need to be considered carefully. Once you have identified an outlier you must decide whether to keep it in your data or remove it. The decision is based on two factors: if it is an error, delete it or fix it (see missing observations). If it is not an error, but it is in-scope for your analysis, retain it. If not, delete it. Either way, you will need to state your assumptions.

7.7 Variable Transformation

Variable transformation is used to modify certain variables to capture non-linear patterns or correct for violations of modelling assumptions. The most common variable transformations are the natural log (ln) and squaring a variable.

7.8 Feature Engineering

The last stage in data exploration has more to do with improving our data set instead of getting to know it better. Here, we spend time adding new variables to the dataset, either by combining current variables or by joining external data that may be useful to our analysis.

For example, it may be the case that the number of car accidents increase when it is snowing, but only during work days. The theory may be that people still drive to work when it is snowing but avoid driving in snowy weather on other days. If that is the case, the data on weather conditions would have to be somehow linked with the data on work days (including statutory holidays) to create a variable ‘Bad_Weather_Work_Day’ = {0 if non-work day OR not bad weather day; 1 otherwise}.

7.9 Batch Data Cleaning

Fortunately, you will not always have to do every step manually. As analytics becomes more commonplace, software vendors are beginning to offer automated tools to clean and explore your data. Many of these vendors utilize machine learning algorithms and a visual interface to allow even non-technical people to clean, tidy, join, and organize data. Two popular products for this are Trifact and Talend.

7.9.1 Trifacta

Trifacta is a data wrangling software specifically for self-serve data preparation. It allows users to explore, transform, and clean data without using any code. The enterprise version of the tool allows you to save data ETL ‘recipes’ that can be replicated and shared across the business. This is useful for two reasons: first, everyone can get to the same single dataset easily, and second, as you add more data you can just re-run the recipe to continue working on an updated dataset. Trifacta has started adding some AI powered tools to suggest optimal data cleaning practices.

7.9.2 Talend

Talend is primarily designed to work with large data lakes and has built in support for Hadoop and HDFS. The primary use case is in data integration. Talend uses its own AI to try and automate the joining and integration of data across your lake. Like Trifacta, there is no code required.

7.10 Importing data

Before we can begin exploring our data, we need to import it. Data comes in a number of different formats, each of which are imported differently. We will discuss a number of those methods here. In Python, we will use the pandas package to import data.

CSVs

Much of the data you will work with as a practitioner will come in a CSV file; effectively all data from a database will be stored this way. Use the pandas.read_csv command. We will be using the Grocery Store Data in CSV form for this section.

grocery_data = pd.read_csv("data/grocery_data.csv", header='infer', index_col='Obs')
grocery_data.head()
##     Grocery_Bill Family_Income    ...     Family_Pet  N_Adults
## Obs                               ...                         
## 1       $357.73      $142,141     ...              1         2
## 2       $276.84      $145,916     ...              0         2
## 3       $197.92       $86,185     ...              0         1
## 4       $315.75      $145,998     ...              0         2
## 5       $202.89       $79,341     ...              0         1
## 
## [5 rows x 9 columns]

There are four arguments you should get familiar with:

file – provide the file path here.

header – the row number of the headers. Default is infer, which tries to identify header names

index_col - the column used to set the row numbers in the data frame

sep – the value separating columns. By default, a comma, but spaces and semicolons are also common.

Hint: if the default values are correct, you do not need to include the argument at all!

Excel Files

While Excel is rarely the best tool for analytics work, it is a tool that everyone has and can use. To import Excel files, use the pandas.read_excel command.

grocery_data_excel = read_excel("<filepath>")

The read_excel command has a few more arguments than read.csv:

path – provide the file path here

sheet_name – either a string or number for the sheet name or sheet number. By default, sheet 1 is selected (integer 0).

header – either TRUE for a header row with column names, or FALSE for data starting in row 1. By default, TRUE.

skiprows – number of rows to skip before importing data. By default, 0. This is useful for files with multiple rows of text headers.

SAS/STATA/SPSS

Pandas still has you covered for importing data from other statistical software. We will not cover them in-depth, however the commands are:

SAS – pandas.read_sas()

SPSS – pandas.read_spss()

STATA – pandas.read_stata()

7.11 Summary Statistics

Python has built in commands to help you get to know your data. We tend to run these three commands on every new dataset to get an early look at the characteristics of the data.

Data Types

The dtypes() command shows the structure of your dataframe.

grocery_data.dtypes
## Grocery_Bill         object
## Family_Income        object
## Family_Size           int64
## N_Vehicles            int64
## Distance_to_Store     int64
## Vegetarian            int64
## N_Children            int64
## Family_Pet            int64
## N_Adults              int64
## dtype: object

Describe

The describe() command shows summary statistics (mean, max, min, quartiles) for each variable in your dataframe.

grocery_data.describe()
##        Family_Size   N_Vehicles     ...        Family_Pet     N_Adults
## count  1000.000000  1000.000000     ...       1000.000000  1000.000000
## mean      2.135000     1.902000     ...          0.159000     1.521000
## std       1.357431     0.901781     ...          0.365859     0.577835
## min       1.000000     0.000000     ...          0.000000     1.000000
## 25%       1.000000     1.000000     ...          0.000000     1.000000
## 50%       2.000000     2.000000     ...          0.000000     1.000000
## 75%       3.000000     3.000000     ...          0.000000     2.000000
## max       6.000000     3.000000     ...          1.000000     3.000000
## 
## [8 rows x 7 columns]

7.12 Initial Data Cleaning

Observe the structure of the grocery store data above. Python will try to guess the correct type of data being imported, but sometimes gets it wrong. In this case, both the family income and grocery bill variable appear to be strings instead of numbers – not good!

We will need to clean these variables before we can use them for modelling. In this case, we need to remove both the dollar sign and the commas so Python knows it is a number, and then tell Python exlicity it is a float.

def remove_dollar_sign(s):
    return s.replace("$","")
    
grocery_data['Grocery_Bill'] = grocery_data['Grocery_Bill'].map(remove_dollar_sign)
grocery_data['Grocery_Bill'] = grocery_data['Grocery_Bill'].astype('float64')

def remove_comma(s):
    return s.replace(",","")

grocery_data['Family_Income'] = grocery_data['Family_Income'].map(remove_dollar_sign)
grocery_data['Family_Income'] = grocery_data['Family_Income'].map(remove_comma)
grocery_data['Family_Income'] = grocery_data['Family_Income'].astype('float64')

grocery_data.dtypes
## Grocery_Bill         float64
## Family_Income        float64
## Family_Size            int64
## N_Vehicles             int64
## Distance_to_Store      int64
## Vegetarian             int64
## N_Children             int64
## Family_Pet             int64
## N_Adults               int64
## dtype: object

The replace command is very similar to a find and replace in Excel. You specify the characters you want to find first (in this case, the dollar sign), and then specify what to replace it with (in this case, nothing). The as.type takes any value and tries to convert it to whatever you suggest (in our case, float64, but other data types are possible!).

7.13 Exploring Data

Python has a number of tools to help us explore data. We used Pandas to import data, but it also has a number of useful tools for filtering, sorting, and cleaning our data. We will also use NumPy and MatPlotLib to help us clean and visualize our data.

7.13.1 Exploring with Pandas

Between Python and Pandas we have many of the same data manipulation options we explored early in SQL. With them, we can sort, slice, filter, and select data however we’d like.

Let’s select a few columns from our grocery data set.

selected_grocery_data = grocery_data[["Grocery_Bill", "Family_Income", "Family_Size"]]
selected_grocery_data.head(5)
##      Grocery_Bill  Family_Income  Family_Size
## Obs                                          
## 1          357.73       142141.0            4
## 2          276.84       145916.0            2
## 3          197.92        86185.0            1
## 4          315.75       145998.0            3
## 5          202.89        79341.0            1

Here we have selected for only three columns in the grocery data – grocery bill, family income, and family size.

To identify only certain rows, we can filter the data on certain conditions.

filtered_grocery_data = grocery_data[(grocery_data["Family_Income"] > 100000)]
filtered_grocery_data.head(5)
##      Grocery_Bill  Family_Income    ...     Family_Pet  N_Adults
## Obs                                 ...                         
## 1          357.73       142141.0    ...              1         2
## 2          276.84       145916.0    ...              0         2
## 4          315.75       145998.0    ...              0         2
## 13         210.36       133674.0    ...              0         2
## 14         328.55       140138.0    ...              0         2
## 
## [5 rows x 9 columns]

Later, we will discuss the theory behind feature engineering. The assign function allows you to create new variables by transforming old ones. For example, if we wanted to add a column that was the percentage of income a family spent on groceries.

grocery_data = grocery_data.assign(Percentage_Income = grocery_data['Grocery_Bill'] 
    / grocery_data['Family_Income'])
grocery_data.head(5)
##      Grocery_Bill  Family_Income        ...          N_Adults  Percentage_Income
## Obs                                     ...                                     
## 1          357.73       142141.0        ...                 2           0.002517
## 2          276.84       145916.0        ...                 2           0.001897
## 3          197.92        86185.0        ...                 1           0.002296
## 4          315.75       145998.0        ...                 2           0.002163
## 5          202.89        79341.0        ...                 1           0.002557
## 
## [5 rows x 10 columns]

We will also discuss a train and test model building approach. To split the data into train and test sets, or to sample, dplyr has two random sample commands.

my_test_frame = grocery_data.sample(frac = 0.3)
my_test_frame.shape[0]/grocery_data.shape[0]
## 0.3

7.13.2 MatPlotLib

Python’s most popular data visualization tool is incredibly flexible. While perhaps not as polished as Tableau, it is just as capable, allowing for multiple types of visualizations as well as embedded analysis. We will go over some examples of how to produce charts here and then discuss the theory behind good data visualizations in the next chapter. The syntax changes very little for plots. Here are a few examples.

Scatterplots

import matplotlib.pyplot as plt

plt.scatter(grocery_data['Grocery_Bill'],
    grocery_data['Family_Income'],s=2,c='black')

plt.show()

You can map additional variables onto a scatterplot, either by size, colour, or shape.

fig = plt.figure()
scatter = plt.scatter(
    grocery_data['Grocery_Bill'],grocery_data['Family_Income'],
    s=1,c=grocery_data['Family_Size'])
fig.suptitle('Grocery Bill vs. Family Income')
plt.ylabel("Family Income")
plt.xlabel("Grocery Bill")
cbar = plt.colorbar(scatter)
cbar.ax.set_ylabel('Number of Children')
plt.show()

Histograms

hist = plt.hist(grocery_data['Family_Income'])
plt.show()

Bar Charts

bar_data = grocery_data.groupby('N_Children', as_index=False).agg({"Grocery_Bill": "sum"})
bar = plt.bar(bar_data['N_Children'],bar_data['Grocery_Bill'])
plt.show()

Note that you can change axis labels using the plt.ylabel and plt.xlabel arguments.

Line Charts

There is no time series data in the Grocery data we are working with, but you can create a line chart with the plt.lines command.

7.13.3 Tidying Data

Pandas has two functions to help you structure your data into a ‘tidy’ form: pivot and melt. The grocery data we have been working with is already tidy, so we are going to generate some untidy data to work with.

import numpy as np

data_size = 5

dti = pd.date_range('2018-01-01',periods=data_size, freq = 'd')

#Generate 3 numpy arrays with normally distributed numbers
#S1: mean of 50, std dev of 10
S1 = np.random.normal(size=data_size,loc=50,scale=10)
S2 = np.random.normal(size=data_size,loc=10,scale=0.5)
S3 = np.random.normal(size=data_size,loc=1,scale=1)

stocks_data = pd.DataFrame({'Date':dti,'Stock 1': S1,
              'Stock 2': S2,'Stock 3': S3})
stocks_data
##         Date    Stock 1    Stock 2   Stock 3
## 0 2018-01-01  56.891128   8.803648  1.464616
## 1 2018-01-02  64.763325  10.001238  1.547458
## 2 2018-01-03  62.856542  10.395576  0.422171
## 3 2018-01-04  71.334175   9.314623  1.155873
## 4 2018-01-05  62.371604   9.903618 -1.375594

Note the columns in the stocks data set. We have a single variable – stock price – distributed across three different columns – S1, S2, and S3. Conducting analysis on data structured this way is very challenging. If we consider how a matplotlib command is structured, it becomes obvious you could never include all three stocks on the same chart. The same problems arise during predictive modelling.

To fix the structure of the data we need combine stock prices into a single column and create a new column as a key. We can use the pandas.melt function for this.

long = pd.melt(stocks_data,id_vars='Date')
long.head()
##         Date variable      value
## 0 2018-01-01  Stock 1  56.891128
## 1 2018-01-02  Stock 1  64.763325
## 2 2018-01-03  Stock 1  62.856542
## 3 2018-01-04  Stock 1  71.334175
## 4 2018-01-05  Stock 1  62.371604

Now, we have a single price column and a key column for each stock – much better. You can reverse this using the opposite command: pandas.pivot.

7.13.4 String Manipulation

Sometimes, exported data will have extra, or missing leading zeros. For example, an old system designed to store customer information may create IDs with a length of 5. When the company grows and reaches 100,000 customers, they will have to increase the length of the ID. Often, they will add a leading 0 to the start of the old customer IDs.

The ljust and rjust functions allow you to add extra characters to the start or end of any string, while the strip command allows you to remove whitespace. .

string = "    hello   "
print("Before:",string,"<end>")
## Before:     hello    <end>
print("After:",string.strip(),"<end>")
## After: hello <end>
print("Before:", string,"<end>")
## Before:     hello    <end>
print("lstrip:", string.lstrip(),"<end>")
## lstrip: hello    <end>
print("rstrip:", string.rstrip(),"<end>")
## rstrip:     hello <end>
print(string.ljust(10,'0'))
##     hello

Chapter 8 Data Visualization, Storytelling, and Dashboards

A largely underappreciated aspect of data science is your ability to tell stories. We have observed solutions that are never implemented despite exceptional analytics work. If you cannot get an executive sponsor to believe in your solution, it will not matter how good your analysis is.

Data visualization is the art of turning data into graphs, plots, and other graphics. Effective visualization can aid a story, reveal insights about your data, and help you explore particularly large data where summary statistics alone are not enough. Remember that many important decision makers are not data scientists and will respond better to a story.

Edward Tufte is an American statistician who became famous for his work in data visualization. In 1983, he wrote A Visual Display of Quantitative Information, which is still used today as a guidebook for excellent visual design. Tufte suggests that “Graphical excellence is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design. Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency. Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space. Graphical excellence is nearly always multivariate. And graphical excellence requires telling the truth about the data. Graphical displays should:

  • Show the data
  • Induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else
  • Avoid distorting what the data has to say
  • Present many numbers in a small space
  • Make large data sets coherent
  • Encourage the eye to compare different pieces of data
  • Reveal the data at several levels of detail, from a broad overview to the fine structure
  • Serve a reasonably clear purpose: description, exploration, tabulation, or decoration
  • Be closely integrated with the statistical and verbal descriptions of a dataset” Tufte also speaks about the concept of graphical integrity; whether or not a graphic truly represents the underlying data. It is much easier to create confusing, misleading, or opaque charts than it is excellent ones. There are a few principles to keep in mind while making graphics to ensure sure you maintain integrity.
  • Remember the ‘lie factor’ – the difference between the size of the effect on the graph vs. the size of the effect in the data. This becomes particularly important if you change the axis size.
  • Never quote data out of context
  • Visualize the data with regard to its underlying distribution (do not distort axes)
  • Label events and other important explanations to remove ambiguity

Even the best of us make mistakes in regards to data visualization. A favourite newspaper of ours, The Economist, recently published a blog post titled “Mistakes, We’ve Drawn a Few”. It catalogs some of The Economist’s worst visualizations, why they were bad, and how they would fix them.

8.1 The Types of Visualizations

We think of data visualizations falling into four categories:

Comparison

Comparisons always consider two or more variables and the differences between them. Good comparisons allow you to easily identify maximums and minimums and tie them back to categories. For comparisons between categories, we recommend you stick to bar charts. For comparisons over time, you can use line charts, bar charts, or radar charts.

For example, one might compare year over year performance with a clustered bar graph or a line chart with a line for each year.

Distribution

Distribution speaks to how data is spread out or grouped, in much the same way you thought about normal or binomial distributions. Distribution charts are normally relatively simple. For single variables, you should think about using histograms. For two variables, think about scatterplots or heat maps.

For example, you may want to understand the distribution of customer profitability with a histogram.

Composition

Compositions show how individual parts make up a whole. For example, you might use a chart to visualize how all of your products contribute to a total revenue number. Unfortunately, most people lean on pie charts for this particular type of visualization – please don’t. Humans are notoriously bad at evaluating the relative sizes of circles and interpreting angles (if you do not believe us; what is longer: the circumference or the height of a can of diet coke?).

For example, you could break out revenue in a waterfall chart by each of its individual components or products.

Relationship

Relationships aim to show the connection or correlation between variables. These charts aim to identify correlations, trends, patterns, and clusters. Typically, we use scatterplots for two variables and bubble charts for three variables. For example, you may want to understand how supply affects demand, or interest rates affect profit with scatterplots. As you think about creating your own visualizations, here are a few of our tips:

  1. Clearly label your charts, axis, and data points
  2. Stick to one idea per slide
  3. Start with summary graphics, and move on to more complicated insights
  4. If you have to explain the graphics in person, they are too complicated
  5. Avoid chart junk – unnecessary graphics, icons, etc. that don’t add to the story or the visual
  6. Make sure your clients understand the underlying data before you show them any analysis

Now that you have some background on data visualization, we should talk about how to expand that knowledge into storytelling and creating dashboards.

For some examples of the best data visualizations out there, this is a great top 10: https://www.tableau.com/learn/articles/best-beautiful-data-visualization-examples

8.2 Storytelling

A single data visualization is one thing; a series of data visualizations with the right headers and story is powerful. To the best of our knowledge, the big consulting firms all use basically the same structure for their PowerPoint presentations (from here out: decks), based on Barbara Minto’s Pyramid Principle. Put simply, the pyramid principle suggests that all presentations should follow a pyramid model:

  1. Executive summary up front, often with a recommendation and three arguments
  2. Each recommendation and the supporting facts in the middle
  3. Supporting pages and data at the end

The pyramid principle is a top-down method of storytelling and puts the recommendation up front. The pyramid principle has been successful because it helps you get to the conclusion very quickly, which is useful in front of an executive audience. Executives tend to be busy and already know where they want to focus. If you are able to put the recommendations up front, have a powerful storyline, and well-structured supporting information, executives will be able to understand your message and focus directly on their interest. You are likely to have a successful boardroom meeting.

It will be challenging as a data scientist to use this method. You will have spent weeks or months on your projects, getting deep into the problem, the business, and the technical details. You will be proud of work, and have plenty of facts and supporting details. Refrain from throwing them all out at once, and do not start with your project journey!

8.2.1 The SCR Framework

There is a simple framework you can use to frame your projects in a powerful way – the SCR framework. SCR stands for Situation, Complication, Resolution (“How to Tell a Story Using the Mckinsey Scr Framework,” n.d.).

Situation

A recognizable and believable summary of the current situation. You need to make sure everyone agrees with the situation at hand before anything can be accomplished!

Over the past 5 years, company X has grown from a startup to a firm with $5 million in revenue and 45 employees.

Complication

Raises a question or change from the status quo – this is your burning platform!

Over the past 2 years, revenue growth has stalled, calling into question the next round of venture funding.

Resolution

Your recommendation. Something that will capture your audience’s attention and resolves the complication.

We should develop a sales growth strategy focused first on increasing share of wallet, and second on acquiring new customers.

If you combine this approach with Minto’s Pyramid Principle, you would put the resolution first, followed by the situation and complication. Now that you have a sense for how to structure your storytelling, here are some tips for how to put together a good deck:

  1. Key messages belong in the header – do not use a generic title like ‘market size’ or ‘recommendations’. If you read the only the headers of all your slides it should tell a story on its own.
  2. Use consistent fonts and colors throughout. One size for headers, one size for all other text. Use bolds and italics for emphasis.
  3. Adjust the size of font to suit your presentation. Minimum 12 point for printed, minimum 14 point for presentations.
  4. Notes and assumptions belong in small font at the bottom of the slide.
  5. Make sure everything is well formatted and aligned. Poor formatting distracts from your message.
  6. Use a visual tracker or ‘breadcrumb’ to let people know what part of the agenda each slide aligns with.

Sometimes, though rarely, you will get questions about your analysis and should be prepared to explain it in an understandable way. In our experience, most audiences will have preconceived notions about what is important. Make sure you have thought through what their concerns and objections will be and can speak to whatever issues they raise. This may be a technical issue about which variables and why there were included or excluded or a more principled one about the approach to the analysis.

8.3 Dashboards

A dashboard is dynamic report containing the key pieces of information that a user requires to monitor and inform decisions on an organization. When developing dashboards, we like to borrow a technique from journalism called ‘progressive disclosure.’ The basic flow is:

  1. Most substantial information
  2. Supporting information
  3. Details

Specifically, you should think of key metrics to include at the very top of your dashboard. These are metrics that an executive needs to run their business such as number of customers, revenue, and costs. Beneath those metrics, you might include a 2-3 supporting graphs. The supporting information normally breaks out key metrics by demographics, time, or other important categories. At the bottom, you would include the details. The details section is flexible and can include anything relevant to support running the business.

Some tips for dashboard creation include:

  1. Know your purpose and audience – an executive dashboard is substantially different from a mid-level manager or operational dashboard.
  2. Limit dashboard complexity. Hick’s Law states that the more choices a user has the harder it is to make a decision.
  3. Add interactivity and the ability to drill down to gain further insight.
  4. Use visual cues to guide the reader – highlighting and bolding certain sections can highlight important areas.

Naturally, different dashboards are suited for different audiences. As with all the analytics principles we have discussed so far, knowing your audience is paramount to a successful project. What is important to a call center manager, a bank executive, or a small business owner are all very different. Some dashboards need to be updated daily, or in real time (e.g., operations focused dashboards), whereas some will only be used weekly or monthly (strategic, company-wide dashboards). Make sure you spend time up-front in your project defining the scope and understanding what your business sponsor is looking for to save time on rebuilding visualizations later.

8.3.1 Tableau

Tableau is a business intelligence tool that allows for graphing and data analysis. It also has robust data importing and linking functionality, allowing for automated data manipulations and dashboard updates. Tableau is one of the more popular business intelligence tools in industry because of how easy it is access data and produce visualizations without code.

We have presented good data visualization principles, but will not delve deeply into Tableau in this text. For extra Tableau resources, we recommend their starter kit.

Chapter 9 Introduction to Modeling

9.1 Introduction to Regression (Part 1)

Linear regression modelling is one of the first truly powerful analytics models that a data scientist encounters. These models encompass and extend many of the statistical techniques one encounters in an introductory class in statistics. In addition to being powerful in their own right, regression models introduce the structure for related models in that they disentangle how multiple independent variables combine to influence an outcome of interest and simultaneously measure the separate contributions of those independent effects. This approach is used in other models, such as discrete choice modelling (Probit, Logit, etc.), models for duration (hazard models), models for counts (Poisson regression), and models for truncated and censored data (Tobit, the truncated regression model, and others).

The general linear regression model combines a series of parameters (\(\beta_0, \beta_1, ..., \beta_K\)) along with the explanatory variables (\(X_1, X_2, ..., X_K\)) into a linear equation of the form:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_K X_K + Error\]

The belongs in every regression model,

A particular form might look like this:

\[Earnings = \beta_0 + \beta_1 Years\_Experience + \beta_2 Years\_Education + Error\; Term\]

The regression method provides an estimate for each of the parameters in the model. It also provides a standard way of testing the parameters, along with diagnostic tools that provide additional information about the fit of variables in the model.

Regression is a really awesome tool!

But, before getting too far into the details, we should acknowledge that there are at least two broadly different views on the role of regression modelling in analytics. Both views are legitimate and lead to useful application of regression modelling, and both are mathematically justified, but they do result in different applications of the tool and lead to different kinds of actions.

Technical Detail

Regression models have both mathematical and statistical properties. The mathematical properties flow from the matrix algebra that defines the model and hold irrespective of the characteristics of the data. The statistical properties follow from the characteristics of the data and the model’s fit with the data generating process (DGP) that creates the data. As we will discuss later, the statistical properties only apply if certain assumptions about the data hold. When they are valid, these statistical properties allow us to develop deeper insights about the problem we are analyzing. The econometric modelling techniques tend to focus a great deal of effort on validating the statistical properties of the models; the black box techniques tend to rely on the mathematical properties only.

At risk of trivializing both methods and emphasizing differences that may be more conceptual than real, we will offer a quick overview of the two.

9.2 Econometric vs Black Box Regression

9.2.1 The Econometric Approach

The econometric view of regression, also known as causal modelling, sees the model development as an interaction between theory as expressed through mathematics and data through the statistics of regression. The mathematical model involved is thought of as describing a simplified version of the process where the variables ‘cause’ the behavior of the dependent variable. For example, a modeler may start with a ‘theory’ that sales are primarily driven by prices and advertising budgets amongst other factors. Based on this theory, she might build a simple model:

\[Sales = \beta_0 + \beta_1 Ad\_Budget + \beta_2 Price + Other\; Factors\; (i.e., \;the\; error\; term)\]

to reflect her thinking that as advertising budgets increase, sales will also go up; and that as prices increase, sales will go down. With the mathematical model developed, the data is then applied to find the best fit to the parameters (\(\beta_0\), \(\beta_1\), and \(\beta_2\)), along with reams of diagnostic information on the quality and fit of the model. The modeler uses these results to develop insights and to refine the model to reflect her improved understanding of the causal relationship involved. Perhaps her analysis suggests that one of the important other factors is the competitors price. She might modify her model to be:

\[Sales = \beta_0 + \beta_1 Ad\_Budget + \beta_2 Price + \beta_3 Competitor\_Price + Other\; Factors\]

This process can be thought of as a very hands-on process that is not easily automated. It involves an analyst using many tools and techniques to improve her understanding of the world. Often this involves looking at graphical output, running specification tests, creating additional variables, and other techniques.

For reasons that will become clear later, the goal will be to find the model that best fits the data with as much of the variation in the dependent variable being explained by the relationship described by the explanatory variables that truly belong in the model. Ideally, very little will be left in the category of other factors / residual / error. A good econometric model should have, amongst other things, a collection of explanatory variables that explain a large amount of the variation, a credible justification for the each variable in the model and a collection of other factors.

9.2.2 The Black Box Approach

The black box method takes a more hands off approach with respect to theory. Aside from setting the content of the original data sets, these techniques (e.g. ridge regression, lasso regression) are driven by algorithms that are highly automated and involve little or no human oversight.

Due to their algorithmic nature, these techniques do not leverage theory of causation. Instead, they focus on managing the tradeoffs between the number of explanatory variables used in the model and the amount of variation explained by the model. Since these are data intensive methods they tend to be concerned with the possibility of overfitting the models – which is to say, building a model that fits a particular data set but is not generalizable. Black box methods can be exceptionally useful when there is a lot of data and the goal is to predict an outcome. They are also useful in situations where there is little in the way of theory to explain the data; the variables that theory would require are not available; or the process must be automated due to resource constraints or turnaround requirements.

Generally speaking, a good black box model is one where a small collection of explanatory variables reliably predict the dependent variable. In particular, the model should be able to predict results based on samples that were not used to develop the model in the first place.

9.2.3 Which Approach is Better?

If you are looking for a clear winner here, you are going to be disappointed. Each method has its strengths and weaknesses.

Relative to the black box method the econometric method is obviously more labor intensive, but it also brings non-sample information to the process in the form of a theory. It also allows the model to be adjusted to reflect the business acumen of the modeler. Perhaps more importantly, the theory is correct – and that is a mighty big iff – then the estimates that come from the econometric method have a credible claim of measuring causal effects. If so, then they can be used for which is a huge advantage over black box methods which can only be used for predictive analytics. While that may sound like a clear win for the econometric method, there are cases where black box techniques easily outperform the econometric technique, particularly in In part this is true because regression is only one of many techniques available to the black box approach, others include decision trees, neural networks, time series analysis, and others.

Even within the scope of regression models, the black box approach can still beat the econometric approach in developing predictions. The black box method can consider thousands of models in the time it takes the econometric method to develop a single model. If there is little value from external theory or the data required by such a theory is not available, but a large quantity of potentially relevant data is available, the speed of the black box method may find a better model than an econometrician could, and do it much faster.

As if to prove this point, in at least one case, econometricians themselves developed black box forecasting techniques due to the difficulties associated with traditional econometric prediction. The field known as time series analysis and models such as ARIMA were developed due to the failure for econometric models to provide reliable forecasts using traditional causal modelling.

One conclusion might be that the econometric approach should be used when an insight about the causal process is required to inform rather than automate a decision; when non-sample insights about the data are going to be important in shaping the analysis and when perscriptive analytics are required. Black box techniques should be used when automated predictions are required; when massive amounts of potentially relevant data are available.

As a data scientist, you should understand, be able to use, and explain both approaches. That said, the econometric approach has a more restrictive set of assumptions because of its use of the statistical properties of regression models, its focus on human-intermediated processing of data and aspirations to reflect causation. For the balance of the book, we will focus on the econometric approach recognizing that what we say may will certainly prepare your data for black box analysis.

9.3 Introduction to Regression (Part 2)

Most textbooks introduce linear regression with the simple linear regression model. This is a model with a single independent variable. Its general form is:

\(Y = \beta_0 + \beta_1 X_1 +\)

A specific model might look like this:

\[Sales = \beta_0 + \beta_1 Ad\_Budget + Error\]

The model is almost never used in practice because most things a data scientist would like to model are explained by more than one factor. The benefit of the simple linear regression model is that it is easy to show graphically. The function used in the linear regression model produces a chart like this.

A Simple Linear Regression Model

Figure 9.1: A Simple Linear Regression Model

The general linear regression model, as described above, has the form:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_K X_K + Error\]

A particular form might look like this:

\[Sales = \beta_0 + \beta_1 Ad\_Budget + \beta_2 Price + Error\]

Naturally, this would be a fair bit more difficult to show in a graph, since it is at least three-dimensional. There is a nice trick, however, which is to focus on a single variable, while holding the others constant at some value like 0, which preserves the interpretation of \(\beta_0\), or their mean, which makes for much more sensible depictions of the results. By doing this, we can depict the relationship between one explanatory variable and the dependent variable.

A Linear Regression Model, All Other Variables Held Constant at 0

Figure 9.2: A Linear Regression Model, All Other Variables Held Constant at 0

Since the multiple regression model can be depicted quite easily using either of these techniques, we will use it throughout the rest of the book holding other variables constant at zero for simplicity recognizing that the readers of our books are smarter and can make this adjustment for themselves.

9.3.1 What does Linear Regression really mean?

Given that we have a general linear regression model, you might be wondering what it actually means. From your previous studies, you should be familiar with the idea that, for any variable you could consider, the value of that variable is the average of the class to which you belong + your individual deviation from the average.

This is pretty abstract, so let’s use a specific example. Now, it is clear that we don’t know you, but we do know that your earnings last year are described by the following equation:

\[Your\; Earnings = Average\; of\; your\; Class + Your\; Individual\; Variation\]

This is really all that a regression equation is doing. When we have the general linear regression model:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_K X_K + Error\]

The x-variables define the class to which the observation belongs. Sometimes the x-variables are discrete variables like \(X_1\) = 1 if employed; 0 otherwise. More often they are continuous, like \(X_2\) = number of hours worked, and occasionally they are interactions like \(X_3\) = yrs_education * employed. The individual variation is the prediction error, or as we suggested above, the other factors that have been omitted from the model.

In short, a linear regression model simply provides the conditional average of a specific dependent variable, \(Y\), based on a collection of one or more independent variables, \(X_1 ... X_K\), that can be continuous, discrete or some combination of the two.

9.3.2 Modelling with Linear Regression

The linear regression model is limited by being linear, but linear in this context is not as restrictive as you might suspect. To see why, let’s consider a few things you can do with a linear regression model.

9.3.2.1 Modelling Different Groups

Linear regression allows you to model distinct groups that have similar responses to some variables but different responses others. Suppose your client, a restaurant manager, suspects his customers spend more at the restaurant the higher their incomes. This would suggest a model:

\[Sales = \beta_0 + \beta_1 Income\]

But, he might also suspect that male customers spend more than others. One way to model this would be to assume that male customers spend the same amount more at any given level of income. This would suggest a model like:

\[Sales = \beta_0 + \beta_M Male + \beta_1 Income\]

Where those observations belonging to a male respondent are marked by a Male = 1 if the customer is male; 0 otherwise.

Linear Regression with a Dummy Variable

Figure 9.3: Linear Regression with a Dummy Variable

You can confirm that the regression model produces these two distinct relationships by considering the set of male customers vs. others. If Male = 0, the customer is not male and the model becomes:

\[Sales = \beta_0 + \beta_1 Income\]

If the customer is male, then Male = 1 and the model becomes:

\[Sales = \beta_0 + \beta_M(1) + \beta_1 Income\]

\[=(\beta_0 + \beta_M) + \beta_1 Income\]

Confirming that we can have two distinct ‘curves’ in one linear regression model and that the Male line is exactly \(\beta_M\) units above the Not Male line.

Two things are worth noting here. In this model the ‘base case’ is the not-male case. The impact of being male is measured relative to this base case. While this may seem trivial, particularly if your data set contains only male and female respondents, but it becomes important when we deal with more complex situations. For example, we might use a similar approach to predict the likelihood of defaulting on a loan payment based on the city someone lives in. In this case, we might have 15 different cities across Canada, each with their own categorical dummy variable and we would have to think carefully about what the base case was. We would also have to think carefully about how to interpret the results.

The second thing is that we have drawn this as though the male curve was above the other curve. In actual fact what we have really done is permitted the model to reflect a difference between male customers and others. The data will decide whether that difference is real and whether it is positive or negative. This illustrates a general principle about modelling and ultimately testing: to test a hypothesis, we develop a model that allows the data to express a characteristic and then test to see if it actually does. Here we SUSPECT that males consume more, so we build a model that allows the data to express males consuming more and then see what the results tell us. Perhaps males consume less, or perhaps there is no difference. We will return to this issue when we discuss testing.

Returning to our example, perhaps upon seeing the output from this model, the client revises his thinking – he now believes that what really happens is that male customers spend a greater share of their income on food in the restaurant, rather than simply purchasing the same amount more at any given price. This could be captured by a different model:

\[Sales = \beta_0 + \beta_1 Income + \beta_M Male * Income\]

This model would produce the following relationship:

Linear Regression with an Interaction Dummy Variable

Figure 9.4: Linear Regression with an Interaction Dummy Variable

By considering the set of male customers vs. others, you can confirm that this single model produces two sloped lines with the same intercept. As before, this model might actually produce a result where sales to the male set of the population is less responsive to income than those to the base case. If that were the case, βM would be a negative number.

9.3.2.2 Modelling Different Time Periods

In addition to modelling different groups, models can be developed to capture different effects at different time periods due to a in the real world process. Models like this are used when the nature of the relationship changes at a particular point in time, rather than evolving slowly over time. They can be used to reflect how relationships might change due to tax policy, competitor behavior, technological change, or any other significant event.

For example, your client, an automotive parts supplier, may have been experiencing an increase in demand for electric car components over time and would like a simple model to forecast demand in the future – perhaps to know when to expand capacity. In this case, a model like

\[Sales = \beta_0 + \beta_1 Time\]

might be a good starting point because. It certainly has the virtue of simplicity and is be easy to develop and use. The relationship captured by the model would look like:
Linear Regression over Time

Figure 9.5: Linear Regression over Time

Now suppose that at some point in time \(T*\), the government introduces a subsidy to increase the sales of electric cars. At that point, your client expects that the sales of electric cars will jump up, but then continue growing at the original rate. In other words, your client expects that sales look like this:

Linear Regression with a Jump Change at Time T*

Figure 9.6: Linear Regression with a Jump Change at Time T*

Based on the discussion above, how could you build such a model? Try writing out the equations that would do it, then test to see if it actually works. Spend a minute or two on this, then check your results below.

Consider one more structural change model. Suppose your client believed that the structural change would not cause a jump, but instead would cause the rate of growth to increase at time period \(T*\). In other words, the relationship she expects looks like:

Linear Regression with a Slope Change at Time T*

Figure 9.7: Linear Regression with a Slope Change at Time T*

Again, try to develop this model, but be warned, it is a fair bit more tricky than the previous one.

Hint 1: It will involve a dummy variable

Hint 2: It will involve a new variable for time.

Hint 3: Solution

9.3.2.3 Modelling with Curves

The ‘linear’ part of linear regression suggests that the model can only capture straight lines. This is not the case – by transforming variables, a wide class of nonlinear relationships can be captured by linear regression. There are limits to what can be done, but can model many ‘non-linear’ relationships including: polynomials, functions of \(\frac{1}{x}\), log linear relationships, and others.

For example, your client may generally believe that sales increase with advertising budget, but that the rate of increase decreases as the So, the relationship she envisions is one like that shown below.

Linear Regression with Decreasing Returns to Scale

Figure 9.8: Linear Regression with Decreasing Returns to Scale

Such a model might be by either of these two models. The first is a simple rational function, also known as a reciprocal function:

\[Sales = \beta_0 + \frac{\beta_1}{Ad\_Budget +1}\]

where we would expect \(\beta_1\) to be a negative number.

The second is a polynomial function, in this case, the function is a second order polynomial, though higher ordered ones could be used in principle:

\[Sales = \beta_0 + \beta_1 Ad\_Budget + \beta_2 Ad\_Budget^2\]

Where \(\beta_1\) would be positive and \(\beta_2\) would be negative.

Their respective graphs would look like this:

Linear Regression with a Reciprocal and Polynomial Curve

Figure 9.9: Linear Regression with a Reciprocal and Polynomial Curve

The first model’s shape is consistent in that it approaches some upwards limit set by \(\beta_0\), but the slope is quite restrictive and is set by a single term, \(\beta_1\). The second model has more flexibility because it has three estimated parameters, but clearly only applies over a narrow range of observations since sales actually go down as Ad_Budget increases beyond about 20.

In addition to showing how curves can be captured by linear regression, these models illustrate the important point that models are designed to apply to a narrow range of data. Beyond that range, they do not have a reasonable hope of predicting results.

These two models were applied to ranges of advertising budgets from 0 to 30 – it is probably not reasonable to attempt to fit a single model to such a wide range of data. If the models were fit to a narrower range of data, it would tend to fit better because the extreme characteristics – the steep slope at the beginning of the first function and the downward curve at the end of the second, would not appear in the range of observations.

Our purpose here is not to claim that these are the best models to use in any particular setting, rather, it is to demonstrate some of the range of modelling options available in linear regression.

9.3.3 Regression Assumptions

Now that you are familiar with what regression can do, let’s explore the requirements for the regression model. As we mentioned earlier, linear regression has both mathematical and statistical properties. Most of the assumptions are necessary for the statistical properties to hold, but two are required for the mathematical properties to hold.

9.3.3.1 Mathematical Requirements

1. There must not be a perfect linear relationship among any of the x-variables.

This means that none of the \(x_0 ...x_K\) variables can be expressed as \(x_i = \sum_{all j \neq i} (a_j x_j)\).

You may have noticed we said all \(x\) variables from \(x_0\) to \(x_K\) – even though we typically refer to having \(k\) \(x\) variables. That was not mistake, the linear regression model assumes that a column of ones is provided as one of the \(x\) variables, this invisible \(x\) variable is associated with the \(\beta_0\) term.

We mention this because this requirement is typically only violated by dummy variables when one includes a dummy variable for every category in the model. Suppose your data set contains a single gender field that can be coded as male or female. You might be tempted to specify a model similar to the one above that modeled male sales in restaurants, only with both male and female characteristics:

\[Sales = \beta_0 + \beta_M Male+ \beta_F Female + \beta_1 Income\]

With suitably defined dummy variables, Male and Female. The problem here is that, since all observations are coded as male or female, you get:

\[1 = Male + Female\]

Which violates the requirement and will result in an when there regression is run.

To avoid this, a category must always be omitted when using dummy variables to indicate category membership. That omitted category becomes the base against which all other effects are measured. In the case above, we used Female as the base category and treated Males as deviations from that category.

The only other time we have seen this emerge in practice is when variables are derived from each other. For example a model might contain, Age, Experience and Education, but somewhere along the line, experience might be calculated by a proxy where \(Experience = Age – Education - 5\), where the 5 represents years before formal education starts.

2. While not strictly a mathematical requirement, some of the nice mathematical features one takes for granted in a regression model will not work if the \(\beta_0\)** term is omitted. We typically recommend retaining the ** \(\beta_0\) ** term, even if it is not statistically significant, unless you really know what you are doing and why.**

Beyond the mathematical assumptions, there are statistical assumptions. These assumptions are a bit trickier than the mathematical ones because the model will typically produce results even when the assumptions are violated – the results may be unreliable – but you will get them. A good analyst will want to and how to confirm that they apply. Clearly, this list is only meant as a minimal technical starting point.

9.3.3.2 Statistical Requirements

1. The model must be specified correctly. In other words, the relationship between the explanatory variables must match that of the real data generating process. Naturally, this means that the relationship must be linear in the general sense described above.

This assumption is probably the least appreciated and most important assumption when it comes to prescriptive analytics. It implies that, for prescriptive analytics at least, the model has to be consistent with the true but unknown data generating process.

There are several ways that models can be misspecified. The most common problem is to have variables. This is often a result of simply not having the data. How significant this problem is depends on the relationship between the variables that are missing and the ones that have been included. If the omitted data is correlated with the included data, the effects of the omitted data will be wrongly attributed to the included data. For example, suppose that productivity of an assembly line worker, increases with work experience, but that work experience cannot be easily measured, though years of education are available. One might be tempted to build a model omitting the variable experience and produce a model:

\[Productivity = \beta_0 + \beta_1 Yrs\_Education + Error\]

The model is likely to work out just fine because the omitted variable, experience, is not likely to be strongly correlated with education. There may be some cohort effects – in that levels of education have been going up over time, so there may be a slight negative correlation between the two variables, but for this example let’s assume it is negligible. In that case, the estimated impact of education, such as it is in this model, is likely to be unbiased and the effect of experience would end up in the error term.

Suppose the data scientist doing the analysis realizes that, while experience may be hard to measure, age is not, and age is correlated with experience, particularly if education is also considered. So perhaps the analyst will chose to include Age even though it doesn’t theoretically belong, but because it improves the fit of the model. The resulting model would be:

\[Productivity = \beta_0 + \beta_1 Yrs\_Education + \beta_2 Age + Error\]

In this case, the omitted variable is strongly correlated to one that is included in the model. This means that the impact of age will be It is likely that age will appear to be associated with higher productivity, even though it is not. And, ironically, including it will improve the fit of the model – and even its in-sample prediction. If this mistake was not detected, and the model were used for recruiting purposes, the managerial implication would be to hire older workers, without regard for their actual experience. This would very likely not produce the productivity gain the client expected.

Even If the model has the right variables, there can still be problems. The model could have the wrong relationships amongst the variables – such as leaving out differences between subsets of the data, leaving out a structural change, or neglecting nonlinear effects. Conveniently, these are the exact extensions to the linear regression model that we demonstrated above!

As you saw when we addressed these problems, we typically corrected the problems by adding new variables to the model. In some sense, failing to correct these problems is often a special cases of omitting variables. You will encounter other cases as you study more regression and related models.

In summary, for model specification errors, the good news is that the structure of the model as

\[Y = \beta_0 + \beta_1 X_1 + ... + \beta_K X_K + Other\; Factors\]

allows the omission of a large number of other important factors provided the impact of those factors is independent of the included ones. However, if excluded variables are correlated to the ones that are included in the model, the estimates will be distorted and using the model for perscriptive analytics will be risky, though the model might still useful for predictions,

2. The error terms each have mean zero, uniform variance and are independent of all other variables. This assumption compounds two assumptions that typically arise in textbooks. We have grouped them together for convenience. Violations of the uniformity of error variance can give rise to a variety of issues. Two common ones are and for time series, though there are others that you may encounter if you study this area further. Fortunately, heteroscedasticity and autocorrelation can be detected and managed with modern statistical software.

3. The X terms can be treated as fixed in repeated samples. This assumption is a bit of overkill. It is designed to ensure that the randomness of the X-variables cannot lead to problems in estimating the parameters due to a relationship in the randomness among the explanatory and dependent variables. Killing off randomness by assuming the x-variables are fixed will certainly remove the prospect of any relationship, but can be replaced by weaker assumptions about independence.

The key problem here is that you don’t actually get to choose the character of your X-variables, so you have to use what you get. Techniques exist for identifying and dealing with some of the problems that arise from this violation.

9.3.3.3 So how does it work?

Linear regression uses the ordinary least squares method to find the parameters that best fit the data in the sense that it minimizes the sum of the square of the individual error terms. To calculate a collection of errors, we need to substitute the estimated values for each of the beta terms, \(\beta_0 ... \beta_K\) with their estimated values, these values are typically noted by putting a character on the individual beta terms. So the regression equation becomes:

\[Y = \hat{\beta_0} + \hat{\beta_1}X1 + ... + \hat{\beta_K}X_K + \hat{E}\]

Rearranging this equation and focusing on a single observation, \(i\), we get the \(i\)’th estimated error term (also known as a residual):

\[y_i - (\hat{\beta_0}+\hat{\beta_1}X_{i1} + ... + \hat{\beta_K}X_{iK} - \hat{y_i}) = \hat{e_i}\]

Graphically, this value can be depicted as:
Linear Regression with Error Calculations

Figure 9.10: Linear Regression with Error Calculations

Since all the estimated values lie along the regression line, you can see that they have less variability than the original values do. This turns out to be the basis for one of the important measures of fit for a regression model, \(R^2\), which measures the share of the original \(y\)’s variability or more technically, its variance that is explained by the variance in the model. In some sense, a perfect model would explain all of the variability in \(y\).

Because of the importance of explaining the variation in outcomes, analysts often over-focus on the role of variance explained in selecting the best regression model to use. Since adding x-variables to any model necessarily increases \(R^2\) by some amount, an over-focus on \(R^2\) tends to create larger and larger models where variables with no theoretic and minimal statistical impact are included in the model. This approach, occasionally described as ‘chasing \(R^2\)’, invariably leads to the creation of unreliable models. This issue will be discussed further in later texts.

Effectively, the OLS method finds the estimated beta terms that minimize the sum of the square of these individual estimate errors. While this may seem like an arbitrary choice, it is actually supported by

We need not go into the justification for this course beyond saying that there are very good reasons behind this choice when the assumptions of the regression model hold. The actual production of regression estimates is done by statistical software. Aside from undergraduate students in classes being taught by mildly psychotic econometrics professors, no one ever calculates a regression estimate by hand. For example, in Python we can build the simple linear model:

\[Grocery\; Bill = \beta_0 + \beta_1 Family\_Income \]

with the following code and the grocery store data. There are a number of different ways to run a regression in Python. We will use sci-kit learn here.

from sklearn.linear_model import LinearRegression
import numpy as np

train_X = grocery_data['Family_Income'].values
train_y = grocery_data['Grocery_Bill'].values

train_X= train_X.reshape(-1, 1)
train_y= train_y.reshape(-1, 1)

reg = LinearRegression().fit(train_X, train_y)

print("B_0 =",reg.intercept_)
## B_0 = [-22.65328321]
print("B_1 =",reg.coef_[0])
## B_1 = [0.00257742]

It is worth noting that there are a number of ways to perform linear regression in Python. The sci-kit learn approach above is best used for a machine learning / black box approach. You will see another method in this text using the statsmodel package that allows for better diagnostic reporting.

As discussed earlier, the estimated Beta terms can be calculated using matrix algebra. When graphed, the line of best fit on a scatterplot is calculated using linear regression.

from numpy.linalg import inv
from matplotlib import pyplot
b = inv(train_X.T.dot(train_X)).dot(train_X.T).dot(train_y)

print(b)
## [[0.00237178]]
yhat = train_X.dot(b)
pyplot.scatter(train_X, train_y)
pyplot.plot(train_X, yhat, color='red')
pyplot.show()

9.3.4 The Role of Data in Regression

9.3.4.1 Quantity and Scope (n and k) in Econometric Modelling

The quality of analytics models are ultimately limited by the quantity and quality of data available to the modeler. For regression modelling, we can think of the data as having two important dimensions, \(n\) and \(k\). \(n\) refers to the number of observations on each of the data elements, which is to say the size of the sample. \(k\) refers to the number of distinct data elements being measured for each element in the sample, which is to say the scope of the data.

Generally speaking, the larger the sample, n, the better. For practical purposes, no one would trust a regression model with fewer than about 30 observations per dependent variable used.

As sample size increases beyond that minimum, two things occur: your estimated results become more accurate and the model becomes less sensitive to certain assumptions.

The accuracy of the model tends to grow because, as we saw with estimates for sample averages, the standard error of the estimated sample mean tends to fall in proportion to \(\frac{1}{\sqrt{n}}\). Something similar happens with regression estimates of the Beta terms – which are in some sense just conditional averages. This means that as the sample increases in size, the estimates become more accurate, but at a decreasing rate: the first 500 observations are much more valuable to you than the second 500 observations.

In terms of our conceptual model for regression, where we say:

\[Your\; Earnings = Average\; of\; your\; Class + Your\; Individual\; Variation\]

Increasing the sample size improves the estimate of the ‘average of your class’ part. It does nothing to address the ‘individual variation’ part. So increasing sample size improves the prediction for an average member of a group, but not an individual member.

Beyond the improvement in accuracy, increasing sample size can reduce the sensitivity of the model to its assumptions. To explain why, we need to recognize that there are two theoretic categories for sample size: standard regression modelling and asymptotic theory.

Standard regression theory requires more strict assumptions about the nature of the error term. It generally speaks to predictions that are unbiased if the OLS assumptions hold as described above. However, one additional assumptions is typically very beneficial here – that the error terms have a normal distribution. This assumption was not listed the set of assumptions above because many of the results of interest hold without it and it is not generally required for asymptotic theory.

Asymptotic theory is used when the samples are sufficiently large as to be treated as though they were ‘approaching infinity’. While that may seem like it should be a very large number, infinity in this sense at least, is not what it used to be – generally a couple thousand observations should do it. When dealing with asymptotic theory, we think of the parameter estimates as converging on the true values if the assumptions hold true. Under asymptotic theory, the residuals can be treated as asymptotically normal which allows for effective hypothesis testing even if the residuals are not normal. The bottom line here as that your first consideration of sample size is to get a couple thousand observations to ensure that asymptotic theory applies. Then you want to increase sample size, if required, to ensure that your estimates are sufficiently accurate to serve your purposes. In terms of sample scope, \(k\), the more variables with a credible claim to belonging in the model, the better. In terms of our conceptual model for regression:

\[Your\; Earnings = Average\; of\; your\; Class + Your\; Individual\; Variation\]

Having more variables that belong in the model allows you to better characterize the class to which an individual belongs and therefore reduces the individual’s variation relative to that class. In short, more variables that belong in the model means less residual variation, means better estimates. The trick here is to avoid pulling variables into the model when they do not belong.

This section was largely informed by Kennedy’s text Kennedy (2008).

Chapter 10 Missing Data and Imputation

For a variety of reasons, the data one would like to use in a regression model may be missing one or more fields from an observation. This can occur because a respondent refuses to or cannot answer a question, because data cannot effectively be matched between data files, or because data that is in the file is known to be an error which effectively means the true value is missing.

For example, a respondent may choose not to completely answer a survey that has 50 questions on it. Perhaps he declines to answer two of the questions – perhaps because he does not know the answer, perhaps because he knows but would prefer not to say. This could leave the data set missing two out of 50 questions. For a data scientist, it should be obvious that throwing out the 48 good data elements because of two missing ones seems wasteful. More importantly, as we will explain below, throwing out the observation could result in biasing the results of the analysis.

Since missing data is a common problem, many techniques have been developed for dealing with missing data. We will discuss a few of them and the consequences.

10.1 Patterns of Missing Data

For reasons you will see, the strategy and consequence of missing data depends on both the pattern of missing data and the quantity of data that is missing. To do so we will have to explain the three patterns missing data can have: missing completely at random, missing at random and missing not at random.

10.1.1 MCAR

A data element is missing completely at random (MCAR), if there is absolutely no pattern as to why an element is missing that relates to either the content of data or unobservable patterns that might relate to the analysis in question. This means that the probability that a particular element is missing is not related to any insights that could come from the data.

In most situations, assuming data is MCAR is a very strong assumption to make. It may occur for technical reasons such as linking databases, but it is difficult to assert when a respondent may choose not to answer a question. An example of MCAR might be if postal code for home and work were used to determine a commute distance. It may be that some postal code data is missing because the respondent doesn’t know the work postal code. It might be reasonable to assume that the probability of someone not knowing their work postal code is unrelated to the field content and therefore the data or any other pattern. The missingness might be related to how long they have been working at that location, but if that is not part of the data set and there are no patterns between time of employment and other issues of interest, the data could be MCAR.

If you are fortunate enough to have MCAR data, that is the least troublesome form of missing data – it does not bias the data at all and can be ‘corrected’ simply by omitted incomplete records from the sample. The results are equivalent to having a smaller n exactly as though a smaller sample were constructed in the first place.

Alternatively, the processes described below might reasonably be used with confidence that they will not bias the results. This could be desirable if the sample size was already small or many observations were impacted by MCAR data.

10.1.2 MAR

Data is missing at random (MAR) if there is a pattern to the probability that an observation is missing that can be completely explained by the data that is available. In particular this means that there cannot be any reference to the value that is actually missing or patterns that might relate to it.

Since part of the MAR definition includes a reference to an omitted piece of data, it is impossible to demonstrate statistically that a data element is MAR. It is necessary to demonstrate this using credible logic. It may be that individuals are less likely to remember, and therefore to respond to a question regarding the category of movie last went to (e.g. action, romance, science fiction, horror, etc.) based on how long ago they went. The non-response would have a pattern, but that pattern might reasonably be assumed not to be based on the content of the missing data, though the probability of non-response might be predicted by other variables.

If data is MAR, it cannot be ignored without risking the validity of the analysis; however, some of the techniques below can yield asymptotically correct analysis.

10.1.3 MNAR

Data is missing not at random (MNAR) if there is a pattern to the likelihood that a data element is missing that depends on the missing element.

For example, if you were to ask a person how many crimes they committed but got away with in the previous year, you might expect that the probability of non-response depended on the content of the answer.

When data is MNAR you have potentially serious problems with the analysis. Worse, they are hard to deal with using existing data because the relevant data is, by definition, missing.

10.2 Techniques

There are at least three categories of techniques that can be used.

10.2.1 Drop Observations

The easiest technique is to simply drop entire observation from the analysis. This only works with MCAR data. If the sample is large enough that the loss of the MCAR observations from the sample have a small impact, this is certainly the easiest way to go.

10.2.2 Mean Substitution

If you cannot drop the observation, you could substitute the missing values with the mean of the field. This gets the element back into the dataset, so it gets the richness from the non-missing elements back into the sample; however, it does a poor job of reflecting the data of the missing element(s).

Our general feeling is that this approach may be acceptable with MCAR observations where the missing elements are small in number and of little importance. Generally, there are better techniques, but this one is certainly fast and easy to explain to clients.

10.2.3 Modelling Techniques

These techniques depend on developing a model for the missing observation based on the remaining data. There are different ways to do this and they are generally built into software, but they work something like this: Suppose that in a dataset, if \(X_2\) was omitted on the 5th out of 200 respondents. In principle, you could use all the remaining x-variables to build a model for \(X_2\). The model might look like:

\[X_2 = \beta_0 + \beta_1 X1 + \beta_3 X_3 + ... + \beta_K X_K + Error\]

Using data on all observations except those of respondent five, whose result was missing, you could estimate the parameters \(\beta_0, \beta_1, \beta_3, ... , \beta_K\). Giving you a model to estimate \(\X_2\) for the fifth respondent using the other x-variables. In principle, this can be used in place of the missing value.

This process is called imputation. When using imputation, an estimated value is substituted for the omitted actual value. While that all sounds good, in actual fact, the resulting estimated values will have less variability than real age data. To correct this most software does ‘multiple imputation’ where several data sets are created with \(Age = Estimated\_Age + Random\_Error\). The model is then run multiple times and the average is used. This reduces the risk of overfitting the data. Again, this is built into software so you will not have to do it manually.

Variants for this approach can correct for MCAR or MAR values, at least in large samples.

10.2.3.1 What to do about MNAR?

MNAR remains a significant problem. We typically recommend interpreting the results with extreme caution and / or using what we like to call ‘prejudicial values’ when doing MNAR analysis. While this is more of a philosophical than statistical approach, a prejudicial value is one that would make it more difficult to find the result that would motivate you to take risky action. Naturally, the character of this value depends on the analysis you were proposing to undertake. You might review our previous text’s discussion on modelling under uncertainty for a more fulsome discussion of the issues involved and why prejudicial values might help with MNAR data.

10.2.4 Multiple Imputation in Practice

Once you have decided multiple imputation is the best approach to handling your missing data, you can begin working in R. The package for multiple imputation is mice. We will be using the same grocery data as before, but with artificially missing data this time. Let us assume that it is MAR.

grocery_data_missing.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1000 entries, 0 to 999
## Data columns (total 9 columns):
## Grocery_Bill         1000 non-null float64
## N_Adults             1000 non-null int64
## Family_Income        977 non-null float64
## Family_Size          1000 non-null int64
## N_Vehicles           1000 non-null int64
## Distance_to_Store    1000 non-null int64
## Vegetarian           1000 non-null int64
## N_Children           1000 non-null int64
## Family_Pet           1000 non-null int64
## dtypes: float64(2), int64(7)
## memory usage: 70.4 KB

Note how family income has only 977 entries.

frmla = 'Grocery_Bill ~ N_Adults + Family_Income + N_Vehicles + ' + \
    'Distance_to_Store + Vegetarian + N_Children + Family_Pet'
    
from statsmodels.formula.api import ols
results = ols(frmla,grocery_data_missing).fit()
results.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:           Grocery_Bill   R-squared:                       0.839
## Model:                            OLS   Adj. R-squared:                  0.838
## Method:                 Least Squares   F-statistic:                     723.7
## Date:                Tue, 26 May 2020   Prob (F-statistic):               0.00
## Time:                        09:20:31   Log-Likelihood:                -4788.7
## No. Observations:                 977   AIC:                             9593.
## Df Residuals:                     969   BIC:                             9633.
## Df Model:                           7                                         
## Covariance Type:            nonrobust                                         
## =====================================================================================
##                         coef    std err          t      P>|t|      [0.025      0.975]
## -------------------------------------------------------------------------------------
## Intercept            23.0180      5.798      3.970      0.000      11.640      34.396
## N_Adults             54.5158      3.878     14.057      0.000      46.905      62.126
## Family_Income         0.0010   8.89e-05     10.718      0.000       0.001       0.001
## N_Vehicles           -1.0368      1.167     -0.888      0.375      -3.327       1.254
## Distance_to_Store     3.8825      0.193     20.151      0.000       3.504       4.261
## Vegetarian           -7.9646      4.264     -1.868      0.062     -16.331       0.402
## N_Children           28.1890      1.231     22.899      0.000      25.773      30.605
## Family_Pet            1.3033      2.875      0.453      0.650      -4.339       6.946
## ==============================================================================
## Omnibus:                       51.282   Durbin-Watson:                   2.089
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):              172.930
## Skew:                           0.084   Prob(JB):                     2.81e-38
## Kurtosis:                       5.054   Cond. No.                     6.23e+05
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 6.23e+05. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """

Note how the regression automatically deletes the missing rows. In the sci-kit learn version, the regression won’t ever run! The code below will use MICE to solve this.

import statsmodels.imputation.mice as mice
import statsmodels.regression.linear_model as sm

#Impute the data set with MICE Data
imp = mice.MICEData(grocery_data_missing)
imputed_data = imp.next_sample()
#Merge the imputed data with the blank data
merge_data = pd.merge(imputed_data,grocery_data_missing,left_index=True,right_index=True)
merge_data = merge_data[['Family_Income_x','Family_Income_y']]
#Family_Income_x are the imputed values
merge_data[(pd.isnull(merge_data.Family_Income_y))]

#Pool the regression results
##      Family_Income_x  Family_Income_y
## 11           71568.0              NaN
## 27           79562.0              NaN
## 55           76931.0              NaN
## 72          132398.0              NaN
## 93          124035.0              NaN
## 123         135966.0              NaN
## 143          92001.0              NaN
## 161          86255.0              NaN
## 182          85280.0              NaN
## 201          85206.0              NaN
## 225         133961.0              NaN
## 246          78444.0              NaN
## 270          79197.0              NaN
## 303          70037.0              NaN
## 342          91828.0              NaN
## 368          73509.0              NaN
## 376         123929.0              NaN
## 441         112202.0              NaN
## 462          90780.0              NaN
## 486         123929.0              NaN
## 522         132080.0              NaN
## 543         111181.0              NaN
## 597          78846.0              NaN
mice = mice.MICE(frmla, sm.OLS, imp)
results = mice.fit(n_imputations=5)
print(results.summary())
##                               Results: MICE
## =========================================================================
## Method:                   MICE              Sample size:          1000   
## Model:                    OLS               Scale                 1066.98
## Dependent variable:       Grocery_Bill      Num. imputations      5      
## -------------------------------------------------------------------------
##                    Coef.  Std.Err.    t    P>|t|   [0.025   0.975]  FMI  
## -------------------------------------------------------------------------
## Intercept         23.4533   5.7469  4.0810 0.0000  12.1895 34.7171 0.0030
## N_Adults          54.8269   3.8230 14.3414 0.0000  47.3340 62.3198 0.0011
## Family_Income      0.0009   0.0001 10.6315 0.0000   0.0008  0.0011 0.0039
## N_Vehicles        -0.7649   1.1515 -0.6643 0.5065  -3.0218  1.4920 0.0023
## Distance_to_Store  3.8878   0.1912 20.3351 0.0000   3.5130  4.2625 0.0022
## Vegetarian        -7.8053   4.2588 -1.8327 0.0668 -16.1523  0.5418 0.0002
## N_Children        28.3693   1.2231 23.1941 0.0000  25.9720 30.7666 0.0084
## Family_Pet         1.5422   2.8335  0.5443 0.5862  -4.0113  7.0957 0.0022
## =========================================================================

In the code above, we tell Python to take the missing grocery data and impute it 5 different times using linear regression. The results of this command are 5 copies of dataset, each with different values for the imputed missing observations. The differences are due to a randomization of the error term in each model. Once we have our imputed data, we use it to build a new predictive model. In this example, we have 5 datasets, and therefore will run 5 separate regressions and pool the results. The summary represents the pooled results of all 5 regressions. As a rule of thumb, you should use a minimum of 5 imputations, or the percentage of missing observations you have (i.e., use 10 imputations if 10% of your observations have missing data).

From here, you can interpret the regression results as you normally would.

Chapter 11 Model Specification

At the very minimum, econometric modelling involves two processes: developing the model specification and estimating the model’s parameters. In the traditional theory, developing the model’s structure should be done without reference to the data – it should be developed by theory. The data should only be used to estimate parameters, not to choose which variables should be included in the model.

You will note that this view is the exact opposite of the black box view, where data is used to both develop and estimate the models. Since we know that the black box methods work, there seems to be a bit of a conflict here. Part of the reason for the apparent conflict is that the approaches purport to accomplish different things. The black box methods generally aspire to provide the best possible estimates; the econometric methods aim to provide estimates of causal factors. These are very different goals and require different processes, even if the underlying mathematical model, regression, is the same.

In econometric modelling the separation of model development and estimation was maintained, to ensure that the statistical properties of econometric models would hold. If data had been used to develop the models, the models would start to suffer from overfitting which would risk invalidating statistical tests. Econometric models can do this, at least in principle, because the theory provides an additional source of information with which the models can be built.

The black box methods, which do not rely on theory, use the information contained in the data to both build and estimate the models. Since these models are not in general used for statistical testing of causal relationships, the risk of this approach is much smaller. The black box approach also emphasizes models as well as having specific techniques that attempt to reduce the risk of overfitting data. Having presented these methods as though they were distinct, it is time for us to In actual fact, as we acknowledged earlier, that difference may be more apparent than real – the two approaches may be better thought of as being at two ends of a spectrum. Both approaches use theory to inform their models, both approaches use data to refine their models, and both processes recognize the value of maintaining different data sets for development, estimation, and testing.

In the econometric literature, the tendency away from a pure division of model developing and estimation has given rise to three philosophical approaches to model development. They are well discussed in Kennedy (one of our favorite books – we own about five copies between the two of us!) We provide a quick overview of these approaches, and then some practical advice.

The three philosophical approaches, as described in Kennedy are the Average Economic Regression, Test Test Test, and Fragility analysis.

11.1 Average Econometric Regression (AER)

This approach takes a strong view on the role of theory in generating the initial statistical model. In AER it is generally presumed that the data generating process (DGP) in the real world is a special case of the one expressed by the model. The role of the data is to provide the best possible estimate of the relevant parameters. This is as close to the theoretic approach as one is likely to get.

In the AER approach, a simple model to explain sales might be used:

\[Sales = \beta_0 + \beta_1 Income + Error\]

In the AER view, if a model does not appear to fit, that suggests problems related to the data and regression process, not the model itself. The response then is to use increasingly sophisticated techniques, such as corrections for heteroscedasticity, autocorrelation, recognizing the potential for structural change, etc. to find the estimated values.
Continuing the example above, if the true process were:

\[Sales = \beta_0 + \beta_1 Income + \beta_M Male * Income + Error \]

The simple model would appear to have heteroscedasticity because the effective error term:

\[Simple\; Error = \beta_M Male * Income + Error\]

Testing for heteroscedasticity would almost certainly produce a strong signal that heteroscedasticity exists (when it doesn’t!)

If problems persist after the estimation methods have been ‘corrected’, only then are model revisions undertaken. Since models are ultimately subjected to data-driven modifications, the AER approach does have some data driven aspects of modelling.

11.2 Test Test Test

As the name suggests, this approach focuses on testing. Here we are moving a bit closer to a black box – or perhaps a glass box – approach. Different modelling specifications are developed, generally informed by theory, and the data, along with numerous rounds of testing, are used to determine which model fits best.

For example, we proposed two models for the decreasing returns to scale in advertising. There are probably a dozen more models that could reasonably be used. In the TTT approach, each model could be compared to see which fits the data best. This makes sense at least insofar as our theory suggests that the impact of advertising budgets should decrease, but it did not say how.

A more complex theory, say of advertising budgets, or one based on experience with similar products, may narrow the field of potential models, but the TTT approach generally wants data to play a role in selecting the model.

Practitioners of the TTT philosophy have plenty of tools at their disposal: two or more models could be nested inside a single, more complex one and formally tested; individual models can be subject to specification tests for omitted variables; various candidate variables and expressions of those variables could be tried. Tests for data problems like heteroscedasticity and others as discussed in AER are also available, but they are used as a second, not first resort.

11.3 Fragility Analysis (FA)

Fragility analysis is a promising but significantly different approach that echoes our discussion of triangulation in modelling found in our earlier text. The idea of fragility analysis is that, beyond prediction, we only care about the causal impact of a few variables – those under our control. So given that we do not know the true model, we should try many specifications and see how the apparent impact of certain key variables changes with the specification.

If the majority of (or perhaps all) specifications lead to the similar estimates, we have achieved a form of triangulation and can be (more) confident in the estimated causal impact. If the results were significantly different across the various specifications, we would have to be awfully confident in which specification applies before attempting to estimate the causal effect.

11.3.1 So how do you get the best model?

It should be clear from the preceding discussion that there is no guarantee here. That said, there are some practices that are worth embracing:

  1. Where possible, use theory to inform your model development.
  2. Maintain multiple data sets for developing, estimating, and testing.
  3. Where theory is weak, or nonspecific, use data to inform the model specification. The approach here should embrace the humility of the TTT approach rather than the more arrogant AER approach.
  4. When testing models, data permitting, over-specify them, which is to say, put in more variables than you need and test down rather than testing up.
  5. Resolve issues of particularly as a result of multiple potential specifications with multi-variate tests (see below.)

For issues where important decisions involving estimated causal impact, triangulating the impact across multiple potential models in the FA style is a safe bet.

Chapter 12 Feature Engineering

A feature is any variable that is important in your model. Feature engineering is the process through which you create new variables in your dataset to help make your model more accurate, more useful, and simpler. It is an important consideration in model building, and becomes increasingly important as you move into more advanced machine learning models. From an econometrics perspective, you must make sure that you have captured the underlying process in your model. From a machine learning perspective, any extra features you can add may allow for more accurate predictions.

Feature engineering is an iterative process and should begin with theory. We recommend you brainstorm what features might be useful, create or acquire them, test them in your model, and potentially go back to the drawing board to decide what other features you might need.

There are five different types of features, but all should be driven by theory. Feature engineering is an art, and every feature you create should aim to solve your modelling problem. You should select which features to create and include in your model based on their importance. Importance can be driven from a technical perspective (i.e., correlation), or from a theory perspective (i.e., you know that price has an impact on sales quantity).

12.1 Indicators

We create indicator variables to represent special groups, classes, or periods of time. We typically represent these as dummy or binary variables (i.e., a 1 when the observation belongs to a group, and a 0 when the observation does not). We will discuss dummy variables, and their impact on models in the next chapter.

Indicators are incredibly useful and easy to create. Some examples of useful indicators include marking groups such as United States vs. Canada, or representing special times of year (for example, Black Friday or Christmas). There are a number of ways to create indicators, but the easiest is with a ‘where’ statement.

Let’s use the Iris dataframe for an example. We will create indicator variable for below average sepal length. This is just one example of an indicator, and is more to demonstrate the coding technique as opposed to theory.

from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = load_iris().data
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width']
dataset = pd.DataFrame(data,columns=names)

avg = np.mean(dataset['sepal-length'])

dataset['large_sepal'] = np.where(dataset['sepal-length'] < avg, 1, 0)

dataset.head(10)
##    sepal-length  sepal-width     ...       petal-width  large_sepal
## 0           5.1          3.5     ...               0.2            1
## 1           4.9          3.0     ...               0.2            1
## 2           4.7          3.2     ...               0.2            1
## 3           4.6          3.1     ...               0.2            1
## 4           5.0          3.6     ...               0.2            1
## 5           5.4          3.9     ...               0.4            1
## 6           4.6          3.4     ...               0.3            1
## 7           5.0          3.4     ...               0.2            1
## 8           4.4          2.9     ...               0.2            1
## 9           4.9          3.1     ...               0.1            1
## 
## [10 rows x 5 columns]

12.2 Interactions

Interactions are combinations of variables in your dataset. In their simplest form, they could be getting to profit by subtracting costs from revenues. In more complicated forms, you could combine collinear variables to capture information from both without causing issues (for example, the Body Mass Index is a combination of weight and height – two collinear variables).

You can also use interaction terms to capture differential impacts of variables on different groups. We will discuss this in more detail next chapter, but the combination of an indicator and another independent variable creates a powerful interaction term.

12.3 Representations

Feature representations are other transformations to variables to make them more useful, including combining sparse groups, making variables more detailed, or creating categorical mappings. Often, these features are used on data that are difficult to model with. For example, time of day is challenging to use in a model, but could be broken into representations for morning, afternoon, and evening. Most commonly, we combine categorical groups with very low sample sizes into an ‘other’ category.

12.4 External Data Joins

The most flexible form of feature engineering is joining external data. In its simplest form, this could be joining two tables together to add more data. Statistics Canada, the US Census Bureau, and Environics data are all easy ways to add extra features to your model. However, there are many other options. Using APIs and other models, we can add extra information to our dataset. For example, Google Maps has an API that will translate coordinates into addresses, postal codes, and cities. Natural Language Processing algorithms can add sentiment scores to your text data. The options here are endless.

12.5 Scaling Data

Sometimes, you will need to scale or transform your data to improve your model. Common transformation such as the natural log, or squaring a variable, and both will allow you to capture additional patterns in your model. As well, you can scale data if you have variables with different magnitudes that you need to compare directly.

If you identify specific patterns in your residual pattern, a common technique is to use a data transformation that matches that pattern (e.g., try squaring a variable if your residuals form a parabola.)

Chapter 13 Dummy Variables

Dummy variables are used in wide set of important techniques in regression modelling. Since we have already encountered dummy variables in both this and our previous text, the current discussion will be concise.

13.1 Coding Dummy Variables

Dummy variables are derived variables that take on a value of either 0 or 1 to indicate category membership. In this way, dummy variables are a subset of categorical variables. For example, above we defined the dummy variable Male such that:

\[Male = 1\; if\; respondent\; is\; male,\; 0\; otherwise\]

By this definition, every respondent is coded with a 0 or 1 in this field. Of course, categorical variables can have a wider range of values, but dummy variables cannot. You may recall from our previous text that we could construct a categorical variable, Marital_Status whose values would be:

\[Marital\_Status = 1\; if\; single,\; 2\; if\; married,\; 3\; if\; divorced,\; 4\; if\; other\]

Dummy variables represent a subset of categorical variables. And since a dummy variable indicates membership in a category with a 1 and non-membership with a zero, the Maritial_Status variables could be recoded into four distinct dummy variables:

\[Single = 1\; if\; single,\; 0\; otherwise;\; Married = 1\; if\; married,\; 0\; otherwise,\; etc.\]

In this way, every So, those dummy variables belong in a collection of dummy variables that effectively span, or make up, the collection of possible marital statuses as defined in the data.

At the time we said that no mathematical operations generally applied to data coded with categorical variables. The exception we noted at the time was that dummy variables can be averaged to provide probabilities or proportions and they can be used in conjunction with multiplication to build linear models. It is the latter that allows dummy variables to extend the power of linear regression to include multiple curves, structural change, and a variety of interaction effects.

In Chapter 9, we used dummy variables to allow male customers to be different in either the level of spending on food at a restaurant or the rate at which income relates to spending on food. We also showed how dummy variables can be used to model structural change. There are two additional effects we would like to demonstrate with dummy variables.

13.2 Multiple Categories with Dummy Variables

When we first modelled restaurant spending in Chapter 9 we included the possibility that male customers spend a different amount on food than other customers. We did this by including a single category of dummy variable. Even when we knew that the data only contained male and female respondents, we still only included one category. The omitted category was captured as the base case – and the effect of being male, if any, was captured as a deviation from the base case.

This is a standard approach in modelling dummy variables: we choose a base case whose dummy variable, if we even bothered to calculate it, would be omitted. We then measure the impact of membership in the other categories by comparison with that category.

For example, in the marital status example, a collection of four dummy variables spanned the marital status category as defined in the data set. If we wanted to model restaurant spending of a single customer using that individual’s marital status as a predictor, we could build a model:

\[Sales = \beta_0 + \beta_M Married + \beta_D Divorced + \beta_R Other + \beta_1 Income \]

The chart might look something like this.

With the dummy variables, Married, Divorced and Other being defined as above and single being the omitted category. In this way, \(\beta_M\) would be the increase in sales to an individual being married RELATIVE to an otherwise identical individual who was single. If \(\beta_M\) was a negative number, the ‘increase’ would be negative. Similarly, the impact of being divorced or in the category Other, relative to being single, is captured by \(\beta_D\) and \(\beta_R\) respectively.

If we wanted to know something about the difference between married and divorced people, we would want to know the Effect of Married vs. Divorced = (Effect of Single + Married) - (Effect of Single + Divorced) = \((\beta_0 + \beta_M) - (\beta_0 + \beta_D) = \beta_M - \beta_D\)

And, we can use similar calculations for any other dummy variable effect.

The only real trick with dummy variables is to make sure that you always omit exactly one category. If you do not omit any categories, the model will have perfect collinearity, which violates the mathematical requirements for OLS, and will likely result with one variable being dropped at random or the model just not running. If you omit more than one category, the results will be difficult to interpret because the omitted category will be some kind of average of . The logic we have developed here for mean shift effects also works for using dummy variables to change slopes.

13.2.1 Simple Seasonal Effects with Dummy Variables

We can also use dummy variables to capture some effects. This is done by simply defining a variable to capture seasonal membership for each pattern of seasonality we are interested in. These relate to peaks or troughs in sales or other dependent variables. They may also occur at regular calendar intervals, like Christmas, or aperiodically like Passover or Easter, or drift like Ramadan. They can occur on an annual, monthly, weekly or other cycle.

For example, we could suspect that sales of flowers go up on Valentine’s and Mother’s day. If we believed that the impact was about the same, we could define a single variable Peak_Season = 1 if week contains Valientine’s or Mother’s Day, 0 otherwise. If we thought they would reflect different peak effects, we could simply define two variables: Valentines_Day = 1 if week contains Valentine’s Day, 0 otherwise; Mothers_Day = 1 if week contains Mother’s Day, 0 otherwise.

As with other dummy variable models, seasonality models must have an omitted category against which the peak or trough effects are measured. This is normally the ‘regular’ period of time.

Chapter 14 Testing

You have encountered hypothesis testing as a major section of our previous text. At that time, the primary use of hypothesis testing was to conduct tests that aligned with proposed business actions, to determine if the data supported undertaking that action. At that time, we promoted a test-developing method starting with the alternative hypothesis, framing it in the form:

\[H_1:\; <We\; will\; undertake\; action...>\; if\; <the\; evidence\; shows...>\]

While testing continues to support questions of a direct business interest, testing in a regression model extends to tests that are designed to help with model development. Tests developed for this purpose have no obvious business interpretation as they are focused on diagnosing potential problems with models and data and implementing corrective action.

The tests you have encountered so far in our texts have been based on the normal or t-distributions. As we progress in linear regression and related topics, we will encounter tests that are based on a much wider range of probability distributions. Some of these distributions appear in many testing contexts, such as F and Chi-squared; others will appear in special situations only such as the tau or Durbin-Watson test statistics that are used in time series analysis.

14.1 Testing - What Remains the Same?

Testing in this expanded context will still be based on a null hypothesis which is assumed to be true for the purposes of the test. This null hypothesis, along with a model to derive a distribution of the test static under the assumption that the null hypothesis is true. A specific value of the test statistic will be calculated from the model and data. That test statistic will be used to determine a p-value, which is to say, the probability of getting a result ‘as extreme as the one we just calculated’.

That p-value will be compared to a pre-determined threshold, alpha. If the p-value is smaller than alpha, the null will be rejected; otherwise the null will not be rejected.

14.2 Testing - What is Different?

In this context, we will often be using distributions other than the normal and t. Generally, you will not need to know what these distributions look like and you will certainly not have to derive any of them. Whatever software you use will generally take care of the mechanics of calculating the test statistics and reporting their p-vales. Beyond that, our study of the normal and t-distributions will give you a sense of what is going on in the computers at least in principle.

There are going to be many more tests available to you – too many to know in general. You will need to be familiar with a range of generic tests, including the joint-F test, the t-test on individual beta terms, and arbitrary F-tests of joint restrictions. You will need to be familiar with special purpose tests such as tests for heteroscedasticity and the Chow test for structural change. Beyond these tests, there are hundreds of others that depend on the types of analysis you end up doing.

14.3 Assessing vs Testing

Beyond formal testing, model development includes a step we frequently refer to as ‘assessing’. Assessing is to testing what a chart is to a summary statistic.

A summary statistic, such as correlation, can tell you some precise information about a very specific question, but it may fail to provide useful information about a related topic. A chart, on the other hand, does not provide as specific a set of information, but it does provide a general perspective that might highlight other questions.

In the same way, an assessment process might involve looking at any number of charts displaying regression output for patterns that might suggest problems. Once a potential problem is detected, the next step is to use one or more formal tests or adjustments to the model to confirm and / or correct the problem. The formal tests are required because people are so good at finding patterns that they will frequently find them where they do not exist; but the informal assessment step is required because there are so many tests that could be applied that some filtration process is necessary.

Our general advice is to produce and review the graphical output for important analysis work. Hopefully, it will reveal nothing of interest but it may reveal an insight you had missed or a problem that, if uncorrected but caught by your client, would cost you your job. Generally speaking, it is a good starting point to think that any unexpected pattern in the data is a problem that you should look into.

14.4 Specific Testing

For the balance of the chapter we will discuss the tests mentioned above. Remember that this is only an introduction to the available tests – there are hundreds of them.

14.4.1 The t-tests for Parameter Significance

At this point in your career you have seen t-tests applied to single and two variable problems. The basic logic of t-tests arose when considering a null hypothesis that the population of DGP’s average, mu, was a particular value. In order to assess this, a sample was drawn and an average was calculated.

The Central Limits Theorem (CLT) assured us that, if the sample was large enough or if the original distribution was normal, the sampling distribution of the sample mean would have a known distribution under the null hypothesis. We would estimate the standard deviation and use this to calculate a t-distribution, which would then be used to calculate the p-value, or probability of obtaining a result as extreme as the one we found.

It turns out that, because regression is simply a conditional averaging process, under reasonable assumptions of samples size or the normality of the error term, the magic of the CLT applies here and each of the estimated Beta terms, from \(0\) to \(k\), have a normal sampling distribution and produce t-distributions when their standard deviations are estimated. This means that it is relatively easy to test the null hypothesis:

\[H_0:\; B_i = 0\; vs.\; H_1:\; B_i \neq 0\]

For every beta term from \(0\) to \(k\). This result is regularly produced by every regression software on the market today.

But, why is this important?

\[Sales = \beta_0 + \beta_1 Price + \beta_2 Ad\_Budget\]

and found that the \(\beta_2\) term is from 0. This would mean that as we increase Ad_Budget, there was no verifiable change in sales. If this result held in it would mean that your advertising budget would not impact sales and would not improve the predictions of sales. That might suggest that advertising budget should be removed from the model.

This is where the different philosophies about model development kick in. In the AER perspective, one might focus on diagnosing problems with the data and regression approach. Perhaps the failure to reject is a result of heteroscedasticity. From the TTT perspective, one might attempt different specifications of the model – perhaps the model suffers from omitted variables, perhaps there is a decreasing return to scale in advertising… but how could you possibly know?

This brings up an important aspect of model testing. To test a relationship, one needs to build a model that allows the data to express that relationship and then test to see if it does. So perhaps the model needs to have an additional term added to capture the decreasing returns to scale – say Ad_Budget2. The new model would look like this:

\[Sales = \beta_0 + \beta_1 Price + \beta_2 Ad\_Budget + \beta_3 Ad\_Budget^2\]

We might now find that we could reject the test and reject the null hypothesis \(H_0:\; \beta_2 = 0\; vs.\; H_1:\; \beta_2 \neq 0\), because \(\beta_2\) is no longer obliged to carry both the positive effect of Ad_Budget and the presumably negative effect of Ad_Budget2.

We could also test to see if Ad_Budget2 belongs in the model by testing: \(H_0:\; \beta_3 = 0\; vs.\; H_1:\; \beta_3 \neq 0\)

There is one small problem here. Ad_Budget and Ad_Budget2 are positively correlated, so depending on the distribution and sample size, we may have an issue of collinearity – which could make distinguishing the effect on the two difficult. We will return to this issue later. For now, we will note that we may have solved the problem if we have enough sample size and / or good enough data.

Before we move on, we will note one final thing. Looking at the model, we could presumably test \(H_0:\; \beta_0 = 0\; vs.\; \beta_0 \neq 0\). But, what would it mean if we rejected the null? Technically, it would mean that but do we care about that? It might be true, but even if it were, there really isn’t an explanatory variable to remove from the model. So the \(\beta_0\) term is somehow different from the others. It also turns out that if \(\beta_0\) were removed from the model we lose some mathematical properties of the model, so a) don’t bother testing it and b) if you do, don’t remove it.

14.5 The Joint-F Test

Given that we just tested and rejected impact of one of the variables, what if we did it with all of them? What if we took our original model

\[Sales = \beta_0 + \beta_1 Price + \beta_2 Ad\_Budget\]

and tested each of the \(\beta_1\) and \(\beta_2\) terms. If they failed to reject the null in a t-test, wouldn’t that mean that the model was garbage – that none of the explanatory variables mattered? The answer to that is a firm yes and no.

The yes part is that if we cannot reject the null that they are each equal to 0, the model is garbage. The no part is because the t-test is not the right test for the job. What we really want to test is: \[H_0: \beta_1 = \beta_2 = 0\; vs.\; H_1:\; At\; least\; one\; of\; \beta_1\;or\;\beta_2 \neq 0\]

For this, we need an F-test. The F-test, like the t-test is a general purposed test and like the t-test, it is based on its own distribution (uninspiringly named the F-distribution.) The benefit of the F-test is that it can simultaneously test any number of the Beta parameters, so we can test all of the parameters of interest at once. As we mentioned above, we are seldom interested in the \(\beta_0\) term’s significance because it does not link to an explanatory variable, so we should not include it in a test to see if the model could actually explain anything.

It turns out that this overall test is so important that it too is generated by virtually all regression software as a standard option. They report the p-value and test statistic as standard output. Specifically, the joint-f test tests:

\[H_0:\; \beta_1 = \beta_2 = ... = \beta_K = 0\; vs.\; At\; least\; one\; of\; \beta_1, \beta_2, ..., \beta_K \neq 0\]

There are some interesting things that are worth noting about this test.

1. It is possible for the Joint-F test to reject the null when none of the individual terms beta terms are significant in their own t-tests.

This result may be caused by collinearity. The individual variables, like height and weight, may be too similar for the regression model to distinguish the effects, but jointly at least one of them is important. In this case, the t-tests may not reject but the F-test will.

2. It is possible for the joint-F to fail to reject the null when one or more of the individual t-tests are significant.

This result occurs because each time you test something that one of the Beta terms = 0, you have a chance of a type 1 error, which is to say, falsely rejecting the null hypothesis. This error compounds over multiple tests, so if you test k individual Beta terms, you are much more likely to make a type 1 error than you are if you perform one F-test.

3. If you cannot reject the null in the Joint-F test, you should not use the model for prediction or estimation, but you can still do some diagnostics on it to find out why it failed. Perhaps you can find that there were some outliers, corrupt data, missing observations, etc. that caused the problem.

14.6 The Tests in Python

We have discussed in detail the assumptions one makes when building an econometric model, but have not provided any details on how you should actually do this. Fortunately, any decent statistical software has convenient methods of assessing and testing your models. Let’s use the regression we built above as an example.

Using the statsmodel package we can look into some of our regression diagnostics easily. We will also demonstrate how to produce some of these results within sci-kit learn, but it is more involved.

from sklearn.linear_model import LinearRegression
import numpy as np

train_X = grocery_data['Family_Income'].values
train_y = grocery_data['Grocery_Bill'].values

train_X= train_X.reshape(-1, 1) #you only need to reshape the array for Simple Regression
train_y= train_y.reshape(-1, 1) #you only need to reshape the array for Simple Regression 

reg = LinearRegression().fit(train_X, train_y)

print("B_0 =",reg.intercept_)
## B_0 = [-22.65328321]
print("B_1 =",reg.coef_[0])
## B_1 = [0.00257742]
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('Grocery_Bill ~ Family_Income',grocery_data).fit()
print(model.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:           Grocery_Bill   R-squared:                       0.635
## Model:                            OLS   Adj. R-squared:                  0.634
## Method:                 Least Squares   F-statistic:                     1733.
## Date:                Tue, 26 May 2020   Prob (F-statistic):          2.13e-220
## Time:                        09:21:13   Log-Likelihood:                -5310.8
## No. Observations:                1000   AIC:                         1.063e+04
## Df Residuals:                     998   BIC:                         1.064e+04
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## =================================================================================
##                     coef    std err          t      P>|t|      [0.025      0.975]
## ---------------------------------------------------------------------------------
## Intercept       -22.6533      6.631     -3.416      0.001     -35.665      -9.641
## Family_Income     0.0026   6.19e-05     41.632      0.000       0.002       0.003
## ==============================================================================
## Omnibus:                       33.493   Durbin-Watson:                   2.088
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):               51.522
## Skew:                           0.295   Prob(JB):                     6.49e-12
## Kurtosis:                       3.943   Cond. No.                     4.58e+05
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 4.58e+05. This might indicate that there are
## strong multicollinearity or other numerical problems.

Here we see the following results:

F-Test

The p-value of the F-statistic represents a test of all the coefficients simultaneously equal to 0. We are looking for a rejection of this test as a suggestion that the model does in fact explain something. If we fail to reject the null, there is a high probability all the coefficients are equal to 0.

t-Tests

Each variable has its own t-test. The p-values are output in the column named Pr(>|t|). Here, the null hypothesis is that the individual variables coefficient is equal to 0. If we are able to reject the null of this test, it suggests that the variable does belong in the model.

R2

The summary function produced both the multiple and adjusted R2 values. R2 explains the amount of variation in your dependent variable the model explains. Multiple R2 always increases with the number of variables, even those that are not significant. Adjusted R2 includes a penalty factor for each additional variable included to trade off extra explanatory power and a more complicated model.

You can get R2 directly from your sci-kit learn model with the following code:

print("R-Squared:", reg.score(train_X, train_y))
## R-Squared: 0.6345909069158749

Last, the estimate column lists the coefficients we have been discussing in depth. Do not worry about the rest of the output; it is not as important from practical standpoint.

14.7 The Plots in Python

You may have noticed we have not yet given you the tools to assess error normality or find heteroscedasticity. You will need to produce a few different plots here.

Residual vs. Fitted

The data plotted here should have no pattern (e.g., parabola, two clusters, cone shaped) and be evenly distributed around 0 on the y-axis.

predicted_y = reg.predict(train_X)
residuals = train_y - predicted_y

import matplotlib.pyplot as plt
plt.scatter(predicted_y,residuals,s=2,c='black')

plt.hlines(0,min(predicted_y),max(predicted_y),color='red',linestyles='dashed')

plt.xlabel("Model Prediction")
plt.ylabel("Residual")
plt.show()

Normal Q-Q

The Q-Q plot gives insight into multivariate normality. The plot should look similar to the one below – the further the points are from the diagonal line, the more likely there are normality problems.

import scipy.stats as stats
residuals = np.reshape(residuals, -1) # you only need to reshape the array for Simple Regression
fig = plt.figure()
ax = fig.add_subplot(111)
stats.probplot(residuals,dist='norm',plot=ax)
plt.show()

Scale-Location

Use this plot to check the assumptions of equal variance in errors (homoscedasticity). You want to see a horizontal line with points spread randomly on either side. If there is a clear cone shaped pattern, it is evidence of heteroskedasticity.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(residuals.reshape(-1,1))
norm_residuals = scaler.transform(residuals.reshape(-1,1))

plt.scatter(predicted_y,np.sqrt(np.abs(norm_residuals)),c='black',s=2)
plt.xlabel("Fitted Values")
plt.ylabel("Root of standardized residual")
plt.show()

Residuals vs. Leverage

The Cook’s Distance plot helps you identify observations with high leverage (i.e., those that may greatly influence the regression results). In this plot, you cannot see the dotted red lines – this is a good thing. Observations outside those lines have high leverage and are worth investigating.

fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.influence_plot(model, ax=ax, criterion="cooks")
plt.xlim([0, 0.004])
plt.show()

14.7.1 Density Plot

The density plot maps the residual values against a normal distribution. In this plot, you are looking for as normal a distribution as possible.

from scipy.stats import norm

mean, std = norm.fit(residuals)

plt.hist(residuals, bins=13, edgecolor='black', density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mean, std)
plt.plot(x, p, color='black')
title = "Fit results: mu = %.2f,  std = %.2f" % (mean, std)
plt.title(title)

plt.show()

14.8 Specific F-tests

Given that you can test all of the parameters with an F-test, it should not surprise you that you can test a subset of them. This can be very helpful when you are dealing with collinearity. When two or more variables are collinear, the individual estimates are sloppy, but the joint effect can be estimated quite accurately. So if you have a model like the modified one above:

\[Sales = \beta_0 + \beta_1 Price + \beta_2 Ad\_Budget + \beta_3 Ad\_Budget^2\]

And you cannot distinguish between the effect of Ad_Budget and Ad_Budget2 because of collinearity, you might want to run the test: \[H_0:\; \beta_2 = \beta_3 = 0\; vs.\; H_1:\; At\; least\; one\; \beta_2\; or\; \beta_3 \neq 0\].

Rejecting this joint null would tell you that at least one of the variables belonged, and therefore Ad_Budget was contributing to your model. It wouldn’t tell you which one belonged, but it would provide clear justification for including them in the model.

For what it is worth there really isn’t a test for collinearity, We (well, at least one of us) would go so far as to say that if people knew more about using F-tests, they would worry less about collinearity.

Hypothesis tests of this form can be run using the statsmodels.regression.linear_model.RegressionResults.wald_test package.

hypothesis = '(Family_Income=0)'
print(model.wald_test(hypothesis))
## <F test: F=array([[1733.18545457]]), p=2.1264991958242397e-220, df_denom=998, df_num=1>

14.9 The Chow Test for Structural Change

Given that we can use an F-test to test for arbitrary collections of parameters, we can create certain collections of parameters to test that will give us additional insight. One common technique is to look for structural change. In order to do this, as with all tests, we need to develop a model that can express structural change and then see if it exists.

We might look for structural change at a point in time, or between two (or more) subsets of a population. The process we would be using is called a Chow Test. Suppose that our model above:

\[Sales = \beta_0 + \beta_1 Price + \beta_2 Ad\_Budget\]

was applied in Ontario and Quebec, possibly with different prices and advertising budgets for different locations in each province. We might wonder if sales in the two provinces behaved the same. Building on the general claim that testing requires that we build a model that allows the data to express a difference and then test to see whether it does, we could define a dummy variable Que = 1 if the data is from Quebec, 0 otherwise. We would then adjust the model to express the difference between Quebec and Ontario customers:

\[Sales = \beta_0 + \beta_1 Price + \beta_2 Ad\_Budget + \beta_{Q1} Price * Quebec + \beta_{Q2} Ad\_Budget * Quebec\]

We could then run the test:

\[H_0:\; \beta_{Q1} = \beta_{Q2} = 0;\; H_1:\; At\; least\; one\; of\; \beta_{Q1}\; or\; \beta_{Q2} \neq 0\]

If we reject the null, then at least one of the variables that cause the Quebec data to have a different process from the Ontario data must have an impact. So, we could conclude that there is a difference between the two provinces.

Note that this does not depend on nor imply that the distribution of the independent variables are different, rather that the relationship between the independent variables and the dependent variables are different between the two.

14.10 Testing for Heteroscedasticity

So far, all of the tests we have reviewed were tests of means – which is a straight forward extension of the tests you have seen in our previous text. Heteroscedasticity is a commonly occurring data problem that violates the assumptions of the OLS model. Left untreated, heteroscedasticity renders hypothesis testing unreliable and makes the parameter estimates inefficient. Each of these problems can be if one is aware of them.

Heteroscedasticity, you may recall, occurs when the variance of the error function is a function of one or more of the x-variables. To detect this you essentially need to have an estimate of the error – which you can obtain from the regular OLS regression estimates. The residual plot graph will look like the below when heteroscedasticity is present.

Once you have these estimated errors, you build a model to attempt to explain the variance of the error function using one or more of the x-variables, their squares and cross products. In some cases, you may have a strong intuition about what causes the heteroscedasticity; other times you do not know so you try everything that could matter. The choice of variables to include in that model determines the type of test.

Two standard models are the White’s and the Breusch Pagan test. Both of these tests are based on Chi-squared distributions, and both of them are available in modern econometrics software packages, so you get a choice.

The White test is a broader test that includes every variable that could possibly matter for heteroscedasticity, which includes all the variables, their squares and cross products. The problem with the White test is that it is very broad and therefore lower in power than a more focused test. It also turns out that it simultaneously tests for the possibility of a range of model misspecifications.

The Breusch Pagan test is generally restricted to a smaller set of variables that can predict the pattern of heteroscedasticity. As a more focused tests it tends to be more powerful than the White test because it estimates fewer parameters.

One can test for heteroscedasticity in Python using either White’s or BP’s test on an regression object. The null hypothesis is that of homoscedasticity. To select for individual parameters in the Breusch Pagan test, replace the model.model.exog with the variables you would like to test.

from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.diagnostic import het_white
import pandas as pd

white_test = het_white(model.resid,  model.model.exog)

bp_test = het_breuschpagan(model.resid, model.model.exog)

labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
print(dict(zip(labels, bp_test)))
## {'LM Statistic': 112.85160052787724, 'LM-Test p-value': 2.325443997967602e-26, 'F-Statistic': 126.95271433036112, 'F-Test p-value': 8.371493775870229e-28}
print(dict(zip(labels, white_test)))
## {'LM Statistic': 113.02363779428859, 'LM-Test p-value': 2.865687904724442e-25, 'F-Statistic': 63.52174177487911, 'F-Test p-value': 1.0818199364818569e-26}

Chapter 15 Practice Problems

Now, using all you know of data cleaning, modelling, and regression, work through these sample problems to practice your skills.

15.1 Subway Riders

After another frustrating encounter with the TTC you have begun collecting data on ridership levels in other cities around the world. You have collected information like ridership, cost of a ticket, cost of alternatives (i.e., parking), and how many stations the subway services. Using the data on the ‘Subway’ tab, you hope to produce some evidence you can use to justify your views on the TTC.

15.1.0.1 a. Prepare the data for building a regression model

Hint 1: You will need a dummy variable

15.1.0.2 b. Build a model to predict ridership in each city

Hint 1: How do I get started?

Hint 2: OK, I have my first model - now what?

15.1.0.3 c. Explain how you know this model is ‘good’.

15.1.0.4 d. How many riders would you estimate in a city in Europe with population 2,200,000, a price per trip of $4.00, parking costs of $30 and 15 stations?

Hint 1:: How do I get started?

Hint 2: Just give me the answer

15.2 Ice Cream Sales

You have recently been hired as the new data scientist at Toronto’s newest Ice Cream parlor. The owner wants you to test a few hypotheses around how sales change throughout the year. Use the data on the ‘Ice Cream’ tab to test the following theories. Each section of the question should be answered separately with a different test.

15.2.0.1 a. The owner believes that temperature has a positive impact on ice cream sales. Can you confirm their hypothesis and the magnitude of the impact?

Hint 1: How do I get started?

Hint 2: How do I interpret the results?

15.2.0.2 b. The owner believes there is an increase in sales during the summer. Can you confirm their hypothesis and the magnitude of the impact?

Hint 1: How do I get started?

Hint 2: How do I capture the summer effect?

Hint 3: How do I interpret the results?

15.2.0.3 c. The owner also believes as the summer progresses sales increase (i.e., sales are higher in the second week than they are in in the first week). Can you confirm their hypothesis and the magnitude of the impact?

Hint 1: How do I get started?

Hint 2: How do I capture the new summer effect?

Hint 3: How do I interpret the results?

15.3 Some data cleaning practice

15.3.0.1 a. Clean the data on tab A. Make sure it is ready for modelling.

Hint 1: How do I get started?

Hint 2: Just give me the answer

15.3.0.2 b. Using the same data, produce a visualization of balance and income.

15.3.0.3 c. Run a regression including X1-X3 using the data on tab B. Identify a data problem.

Hint 1: How do I get started?

Hint 2: What plot/test do I need?

Glossary

Term Definition
AER (Average Econometric Regression) A modelling philosophy principled in theory, suggesting that an inaccurate model requires more complex modelling techniques as opposed to a different specification
Analytics A problem solving tool; any time one uses data to provide insight into a business or problem, or inform action.
ARIMA (Autoregressive Integrated Moving Average) AutoRegressive Integrated Moving Average, which is a class of model that captures a suite of different standard temporal structures in time series data.
Artificial Intelligence Intelligence displayed by machines.
Autocorrelation The correlation of a variable with a delayed copy of itself to assess if a correlation in behavior exists. For example, this can be used in assessing stock prices.
Big Data A term that refers to data sets which are too large or complex to be analyzed by traditional data processing applications.
Chi Distribution A continouous probability distribution, often used for evaluating proportions.
Coefficient (Beta) A measure to compare the strength of impact each independent variable has on a dependent variable.
Collinearity A condition where some independent variables are highly correlated.
Critical Data Elements The data elements most important in enabling a business to meet regulatory, operational and business obligations
Cross Sectional Data Data collected by observing many subjects (such as individuals, firms, countries, or regions) at the same point of time, or without regard to differences in time.
Dashboard A graphical user interface that provides a point-in-time view of key performance indicators.
Data A set of qualitative or quantitative values about subjects.
Data Governance Organizational programs and reporting structures to monitor data characteristics, ensuring it is fit for use.
Data Management Disciplines associated with managing data as a valuable asset within an organization.
Data Privacy The relationship between collection, use and dissemination of data collected about individuals. For example, customers and employees.
Data Science A multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
Data Steward A data governance role witin an organization responsible for executing data management programs.
Data Wrangling Refers to the transformation of raw data into a format more appropriate for downstream purposes, such as analytics.
Deep Learning A series of machine learning methods based on artifical neural networks which mimic a decision-making process.
Dependent Variable The output or outcome whose variation is being studied.
Descriptive Analytics Backward looking analytics often used to describe the past. Examples include reporting, dashboards, charts.
Diagnostic Analytics Explorative analytics used to understand the relationship between variables in order to diagnose cause of an outcome. Examples include correlations, regressions.
Dummy Variable A numerical variable used in regression analysis to represent subgroups of the sample in a study. For example, it can be applied to differentiate Canadians vs. Americans. Also known as an indicator.
Econometrics The application of statistical methods to economic data.
Elastic Net A combination of Ridge and Lasso regression.
External Data Data collected from sources external to an organization which cannot be otherwise gathered.
F Distribution A right-skewed distribution used in analysis of variance. It is used to compare statistical models that have been fitted to a data set to identify the model that best fits the population.
Feature A distinctive attribute.
Fragility Analysis (FA) A modelling philosophy suggesting that one really cares about variables they can care about, and how stable the coefficients for those variables are amongst different model specifications.
GDPR General Data Protection Regulation is a European Union (EU) regulation that dictates organizations’ accountability for the collection, use, dissemination and deletion of personal information and data about EU citizens.
GLS Generalized Least Squares a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model.
Hadoop A collection of open-source software which uses multiple computing mechanisms to solve problems using big data.
Heteroscedasticity A characteristic of data that leads to an increase or decrease in the variability of the error term of a regression.
IDE An integrated development environment is a software application that provides comprehensive facilities for building, editing and de-bugging code. For example, Atom.
Independent Variable The input which is assessed for potentially causing a variation in an output.
Interaction Dummy A classic dummy variable (or indicator) multiplied by a continuous variable. Allows for slope changes between groups in a regression.
Internal Data Data recorded or created and stored within an organization for purposes of meeting operational needs and delivering products / services.
Lasso Regression Least Absolution Shrinkage and Selection Operator. A type of linear regression which penalizes coefficients far from zero.
Linear Regression A linear approach to modeling the relationship between independent variables and a dependent variable.
Machine Learning Algorithms and statistical models used by computer systems to perform a specific task without receiving explicit instructions, relying on data patterns instead. A subset of artificial intelligence.
MAR (Missing at Random) Missing values are not random, and can be accounted for by variables with complete information.
MCAR (Missing Completely at Random) Values are missing at random and independent of observable or non-observable parameters.
MNAR (Missing Not at Random) The value of the missing variable is related to the reason it is missing.
Multiple Imputation The process of replacing missing data with substituted values.
Natural Language Processing A sub-field of artificial intelligence concerned with the interactions between compuers and human language. For example, interpreting phone conversations.
NLP (Natural Language Processing) A sub-field of artificial intelligence concerned with the interactions between compuers and human language. For example, interpreting phone conversations.
Nominal Data Data used to label variables without a quantitative value. For example, gender and location.
Non Normal Error Term Any time the error term of your regression does not follow a normal distribution. This violates an underlying assumption of OLS.
Normal Distribution A bell-curved distribution where the majority of values fall witin 1 standard deviation of the mean.
NoSQL Not Only SQL; an alternative to traditional relational databases, where data is placed in tables and the data schema is designed before the database is built. This is useful for working with large databases.
OLS Ordinary Least Squares is a type of linear least squares method for estimating the unknown parameters in a linear regression model.
Ordinal Data Data organized in accordance with a priority category, in a scaled manner. For example, satisfaction ratings.
Panel Data Multi-dimensional data collected over a period of time for individual entities. For example, income and age recorded for an individual over 15 years.
PIPEDA Personal Information Protection and Electronic Documents Act is a Canadian law which dictates private sector organizations’ accoutability for the collection, use, dissemination and deletion of personal information about Canadian citizens.
Predictive Analytics Forward looking analytics used to predict an unknown outcome based on past data.
Prescriptive Analytics Forward looking analytics used to recommend the best possible outcome based on past data.
Python A programming language for statistical analysis and visualizations.
R A programming language for statistical analysis and visualizations.
Ratio Data Quantitative data with an equal and definitive ratio between each data and absolute “zero” being a treated as a point of origin.
Ridge Regression A type of linear regression aimed at decreasing model complexity without removing variables from the model. Ridge regression forces variables coefficients closer to 0.
RStudio An open-source integrated development environment for R.
Schema A framework to organize and interpret information.
SCR (Situation, Complication, Resolution) A storytelling structure for documents and presentations.
Storyboard Planning the development of a document, for example a PowerPoint deck, by outlining key sections and messages.
Structured Data Data that resides in a fixed field within a record, with a pre-determined format. For example, financial transaction data.
Structured Query Language (SQL) A domain-specific language used in programming and designed for managing data held in a relational database management system, or for stream processing in a relational data stream management system.
Time Series Data Data points indexed in time order. For example, daily, monthly or annually.
TTT (Test Test Test) A modelling philosophy largely grounded in testing variable significance. The philosophy ‘tests down’ from larger models.
Unstructured Data Data that does not follow a pre-determined format. For example, recorded phone conversations in a call centre.
Variety The number of types of data in a database or file.
Velocity The speed of data processing.
Veracity The degree to which data is accurate, precise and reliable.
Volume The amount of data in a database or file.

David Reinsel, John Rydning, John Gantz. 2018. “The Digitization of the World: From Edge to Core.” https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf.

Kennedy, Peter. 2008. “A Guide to Econometrics 6e.”

Piatetsky, Gregory. 2018. “Python Eats Away at R: Top Software for Analytics, Data Science, Machine Learning in 2018: Trends and Analysis.” https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html.